2026-01-04 · Updated 2026-07-15

LLM Fine-Tuning Guide: LoRA, QLoRA, DoRA, Unsloth, Axolotl, and Deployment

Most fine-tuning failures are decision failures. A team trains before proving that prompting, retrieval, or constrained decoding cannot solve the problem; evaluates on the training distribution; or discovers after training that the artifact is awkward to serve.

This guide treats adaptation as an experiment with an operational exit. It starts with the decision boundary, then follows one path through data, LoRA or QLoRA, task-specific evaluation, export, and serving.

TL;DR: Fine-tune when a measured baseline shows that changing model behavior is worth a weight update. Use retrieval for changing facts and constrained decoding for syntax. Start with LoRA when the base model fits comfortably; use QLoRA when frozen-weight memory is the constraint. Hold out an evaluation set before training, compare against the untuned baseline, and choose the deployment artifact before committing to a method.

Should you fine-tune at all?

Before you spend GPU hours, decide whether fine-tuning is the right tool for the problem in front of you.

Decision Flowchart

Fine-tuning vs RAG

Fine-tuning can change how a model uses domain language, but it is a poor update mechanism for facts that change or must be cited. Retrieval and fine-tuning solve different parts of the problem and often belong in the same system.

Feature	Fine-Tuning	RAG (Retrieval-Augmented Generation)
Core Function	Alters internal weights to teach skills, styles, or behaviors	Provides external, up-to-date context at inference time
Best For	• Specific conversational styles • Complex instruction following • Domain-specific reasoning	• Rapidly changing data (news, stock prices) • Reducing hallucinations (grounding) • Citing sources
Knowledge Handling	Changes statistical behavior in the weights; exact recall is not guaranteed	Retrieves records or passages that can be updated and cited
Update Frequency	Requires retraining for updates	Updates immediately with new documents

Fine-tuning vs prompt engineering

Modern LLMs respond well to clear prompts and examples. Test those options before you invest in fine-tuning.

Aspect	Fine-Tuning	Prompt Engineering
Setup Cost	High (data curation, GPU compute, iteration)	Low (iterative prompt refinement)
Flexibility	Requires another training and release cycle	Changes with the prompt
Format/Style	Can make a repeated behavior more likely	Often sufficient for style and simple formats
Latency	Can shorten repeated instructions	Depends on prompt length and provider caching
Best For	Complex behaviors, distillation, cost at scale	Rapid iteration, changing requirements

[!TIP] Try prompting first Start with a prompt and representative examples. If only the output syntax is unreliable, add constrained decoding before changing the weights.

Fine-tuning vs constrained decoding

Libraries such as xgrammar and outlines constrain generation to a JSON schema, regular expression, or grammar. Depending on the constraint and backend, they compile an automaton or grammar and mask invalid next tokens. No weight update is required.

This guarantees membership in the supported output language—not that the values are true, complete, or semantically appropriate. A syntactically valid function call can still contain the wrong customer ID.

Aspect	Constrained Decoding	Fine-Tuning
Setup	Immediate—define schema, deploy	Requires data curation, GPU compute, iteration
Guarantee	Valid syntax for the supported constraint	Learned behavior; schema compliance can vary
Flexibility	Change schema anytime without retraining	Locked after training
Latency	Slight overhead (model may “fight” schema)	Lower (model naturally outputs format)
Best For	JSON, choices, grammars, tool-call syntax	Repeated task behavior the base model lacks

A practical order:

Start with prompting and few-shot examples for basic formatting.
Add constrained decoding (xgrammar or outlines) when syntax is inconsistent.
Fine-tune only when you need behavioral changes that a schema can’t enforce.

Quick reference: matching problems to solutions

Challenge	First mechanism to test	Why?
Missing knowledge	RAG	Models hallucinate facts. Retrieval provides grounded, up-to-date context
Wrong format/tone	Prompt Engineering	Modern models follow style instructions well via few-shot examples
Invalid output syntax	Constrained decoding	Enforces a supported schema or grammar during generation
Repeated task failure	Fine-tuning (SFT)	Learns from curated input/output examples
Pairwise preference mismatch	Preference optimization	Uses chosen/rejected examples after the task behavior is measurable
Latency/cost at scale	Distillation (SFT)	Train smaller student model on larger teacher’s outputs
Reduce model size	Quantization	No training—compress weights (FP16→INT4) for faster inference

Make the business case measurable

Fine-tuning may reduce recurring prompt tokens or let a smaller model meet the target, but neither saving is automatic. Calculate the break-even point with your own traffic and pricing:

[ \text{break-even requests} = \frac{\text{training + evaluation + deployment cost}} {\text{baseline cost/request} - \text{tuned cost/request}} ]

If the denominator is small, negative, or based on an unproven quality assumption, the project does not yet have an economic case.

[!TIP] The hybrid setup A common architecture is a task-adapted smaller model plus retrieval for changing facts. Treat the larger prompted model as the baseline and keep the smaller model only if it meets the same task-specific quality and safety thresholds.

Types of fine-tuning

There are three main shapes fine-tuning can take. They differ in what kind of data they need and what they teach the model.

Fine-Tuning Types

1. Continued pre-training (unsupervised)

You train the base model on more raw text, with no labels. It keeps doing next-token prediction, just like during the original pre-training run.

When to use it:

The domain has vocabulary the base model never saw (medical, legal, internal codebases).
You have piles of domain text but no labeled (input, output) pairs.
The base model fumbles domain-specific terminology.

Example: training on millions of clinical notes so the model picks up medical abbreviations, drug names, and clinical workflows.

2. Supervised fine-tuning (SFT)

SFT trains on labeled (input, output) pairs. You show the model the exact output you want for each input.

When to use it:

You have a specific task with a clean input/output format.
You have quality labeled data, even in small amounts.
You need predictable behavior on a known shape of input.

Example: training on (SQL query description, SQL code) pairs for text-to-SQL.

{
    "input": "Get all users who signed up last month",
    "output": "SELECT * FROM users WHERE signup_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH)"
}

3. Instruction tuning

Instruction tuning is a special case of SFT designed to make models follow a wide variety of natural-language instructions. The training data is (instruction, response) pairs across many different tasks.

When to use it:

You want a general-purpose assistant (like ChatGPT or Claude).
The model needs to handle varied, open-ended requests.
You’re building a chat interface.

Example: training on thousands of diverse instructions like “Summarize this article,” “Write a poem about X,” “Explain Y in simple terms.”

Comparison

Aspect	Continued Pre-training	SFT	Instruction Tuning
Data	Raw text	(input, output) pairs	(instruction, response) pairs
Labels	None (unsupervised)	Task-specific	Diverse tasks
Goal	Domain knowledge	Specific task behavior	Follow any instruction
Data Volume	Usually the largest corpus	Determined by task coverage and error diversity	Usually broader than task-specific SFT

[!NOTE] What people actually do SFT and instruction tuning use the same next-token objective; the distinction is the breadth and construction of the dataset. Continued pre-training is a separate experiment and should be followed by tests for both domain gains and general-capability regression.

The 7-stage fine-tuning pipeline

Fine-tuning is a pipeline, not a single command. Each stage has its own failure modes, and skipping one usually shows up as a bad model later.

7-Stage Pipeline

Each stage builds on the previous one:

Data preparation — Define the unit of evaluation, split the data, then clean and format it
Model selection — Choose the right base model and load weights
Training setup — Configure hardware, hyperparameters, and optimization strategy
Fine-tuning — Run SFT, DPO, or ORPO training
Evaluation — Benchmark performance and validate quality
Deployment — Export and serve your model
Monitoring — Track performance, maintain, and iterate

[!WARNING] Data is the foundation Training reproduces systematic flaws in the examples. Inspect labels, leakage, coverage, and policy compliance before spending time on optimizer sweeps.

Stage 1: Data preparation

Most fine-tuning projects fail here, not in training. Modern data prep is more than running a regex over CSVs.

Data Pipeline

The 5-stage data pipeline

Tools such as DataTrove and Distilabel can help at scale, but the pipeline should be driven by the failure taxonomy and data contract rather than by a preferred tool.

1. Ingestion and filtering

Action: drop refusals (“I cannot answer that”), broken UTF-8, and non-target languages.
Tools: Trafilatura for extraction, FastText for language ID.

2. Sensitive-data policy

Action: decide what the model is allowed to learn, then redact, tokenize, or exclude personal and confidential fields as required.
Tools: Microsoft Presidio or scrubadub.
Why: a detector is only one control; provenance, consent, retention, access, and deletion requirements still apply.

3. Deduplication (MinHash LSH)

Action: drop near-duplicates so the model does not memorize them.
Tools: DataTrove handles terabyte-scale processing well.

4. Synthetic augmentation, if needed

Action: use a stronger teacher model (GPT-4o, DeepSeek-V3) to rewrite raw data into clean instruction-response pairs.
Tools: Distilabel.
Validation: sample teacher outputs, check them against the same rubric as human labels, and keep synthetic and human-written slices separate in evaluation.

5. Formatting

Action: convert to a standard format (Alpaca or ShareGPT).

Data format examples

Alpaca format (instruction-following):

{
    "instruction": "Summarize the following text.",
    "input": "The text to be summarized...",
    "output": "This is the summary."
}

ShareGPT/ChatML format (conversational):

{
    "conversations": [
        { "from": "user", "value": "Hello, who are you?" },
        { "from": "assistant", "value": "I am a helpful AI assistant." }
    ]
}

What actually matters

Coverage before volume. Add examples that represent distinct failure modes, not repetitions of the easy majority case.
Cleanliness. Drop irrelevant text, normalize whitespace, keep formatting consistent.
Balance. Preserve important rare cases and report performance per slice.
Separation. Split by source, user, document, or time when random row splits would leak near-duplicates.
Provenance. Record the source, license or permission, transformation history, and deletion path for every dataset version.

Stage 2: Model selection and hardware

Picking the base model and understanding the GPU floor decide what you can actually train.

Start with the smallest base model that already clears the non-negotiable baseline checks. Confirm:

license and redistribution terms for the intended product;
language, domain, tool-use, and safety behavior before adaptation;
tokenizer and chat-template compatibility with the dataset;
maximum context and truncation behavior needed by real examples;
support in both the training framework and target serving engine.

Fine-tuning is an adaptation step, not a repair for an unsuitable base. If the model fails capabilities the dataset does not cover, choose another base before collecting more epochs.

Size the run, not the marketing tier

There is no durable “model size → GPU” table. Peak memory changes with weight precision, optimizer, trainable parameter count, sequence length, micro-batch size, activation checkpointing, attention implementation, and framework overhead. Start with a memory estimate, then run a short maximum-length smoke test on the exact stack.

Memory component	Full fine-tuning	LoRA	QLoRA
Base weights	Training precision	Frozen, usually BF16/FP16	Frozen, typically 4-bit NF4
Gradients	All trainable weights	Adapter weights	Adapter weights
Optimizer states	All trainable weights	Adapter weights	Adapter weights
Activations	Depends on batch and sequence length in every method	Same dependency	Same dependency

The original QLoRA paper fit a 65B LLaMA model on one 48 GB GPU in its specific setup. That result is a useful bound, not a promise that every current 70B architecture, context length, kernel, or trainer will fit the same device.

Memory math

For a model with (P) parameters, weights alone require approximately (2P) bytes in BF16/FP16 or (0.5P) bytes at four bits, before quantization metadata and runtime buffers. Full Adam-style training adds gradients, optimizer states, and often higher-precision master weights. LoRA avoids most trainable-state memory; QLoRA additionally reduces the frozen base-weight footprint. Activations can still dominate at long sequence lengths.

Use this workflow:

Choose the longest sequence and micro-batch you must support.
Estimate weights and trainable state, leaving headroom for activations and kernels.
Run one forward/backward step at maximum length.
Record peak allocated and reserved memory.
Only then scale batch size, rank, sequence length, or GPU count.

Stage 3: Training methods (PEFT and LoRA)

Full fine-tuning vs PEFT

Full fine-tuning (FFT) updates every weight, so gradients and optimizer state scale with the whole model. The peak cannot be inferred from parameter count alone, but it is far above the memory needed to load the weights for inference.

Parameter-efficient fine-tuning (PEFT) trains only a small subset of parameters and freezes the rest. The math gets a lot friendlier.

LoRA: the starting point

LoRA (Low-Rank Adaptation) freezes a pre-trained matrix and represents its learned update with two smaller matrices. The original paper motivates this with the hypothesis that useful adaptation updates have low intrinsic rank.

LoRA Architecture

For a frozen matrix (W0 \in \mathbb{R}^{d{out} \times d_{in}}), LoRA learns:

(A \in \mathbb{R}^{r \times d_{in}})
(B \in \mathbb{R}^{d_{out} \times r})

The adapted layer is:

[ W’ = W_0 + \frac{\alpha}{r}BA ]

The adapter has (r(d*{in}+d*{out})) trainable parameters instead of (d*{in}d*{out}) for that matrix. For a square 4,096-wide matrix at rank 16, that is a 128× reduction for the matrix—not 10,000× for an arbitrary model. The LoRA paper’s 10,000× headline was a specific GPT-3 175B configuration that adapted selected matrices.

PEFT methods compared

Method	What changes	Choose it when
LoRA	Frozen base plus low-rank trainable updates	The base model fits comfortably and you want small task artifacts
QLoRA	LoRA with the frozen base stored in 4-bit form	Base-weight memory is the limiting factor
DoRA	Separates weight magnitude from a LoRA-updated direction	A measured LoRA baseline leaves a quality gap worth extra complexity
Full fine-tuning	All model weights	PEFT misses the target and the quality gain justifies distributed training and full checkpoints

When to pick which

LoRA: start here. Fast, memory-efficient, well-supported.
QLoRA: when the same LoRA experiment does not fit because of frozen base weights.
DoRA: after an apples-to-apples LoRA comparison shows a useful gain.
Full fine-tuning: only after PEFT is an evaluated bottleneck rather than an assumption.

DoRA: weight-decomposed LoRA

DoRA (Weight-Decomposed Low-Rank Adaptation) separates the magnitude of each weight vector from its direction. It applies a LoRA update to the directional component while learning the magnitude separately.

DoRA Architecture

How it works:

Instead of treating weights as a single entity, DoRA breaks pre-trained weights into two components:

Magnitude — a trainable value per weight vector.
Direction — a normalized vector updated through low-rank matrices.

In compact column-wise notation:

[ W’ = m \frac{V + BA}{\lVert V + BA \rVert_c} ]

where:

m = magnitude (trainable)
(V) = the frozen directional matrix
(BA) = the learned low-rank directional update
(\lVert \cdot \rVert_c) = column-wise normalization

What you get for that extra structure:

More degrees of freedom than standard LoRA because magnitude can change independently.
Better results than LoRA in several configurations reported by the proposing paper.
Extra parameters and computation, so the gain should be verified on your task and serving path.

Adapter merging for multi-task learning

Separate adapters let one frozen base support several tasks. You can route requests to an adapter, serve multiple adapters from one engine when supported, or create an offline merged candidate. Merging can introduce interference, so evaluate the merged artifact rather than assuming the source adapters compose cleanly.

Common merging methods:

Concatenation — combine adapter parameters and grow the effective rank. Fast and simple.
Linear combination — weighted sum of adapters. Gives you knobs.
SVD — matrix decomposition for merging. More flexible, but slower.

Example: one adapter for summarization, another for translation, merged into a single multi-task model.

Stage 4: Fine-tuning and preference alignment

SFT learns demonstrations. Preference optimization instead learns from comparisons such as “chosen response A is better than rejected response B.” Use it only when pairwise preference is the right label for the error; factual correctness and policy compliance often need stronger evaluators than a global preference.

Alignment Methods

PPO-based RLHF

The original recipe was a three-stage pipeline:

SFT — learn the task.
Reward model — train on human preferences (chosen vs rejected).
PPO (Proximal Policy Optimization) — reinforcement learning to optimize the policy.

The operational cost comes from the moving parts:

Complicated to implement and maintain.
Expensive — you train multiple models.
On-policy sampling and reward optimization require careful stability and reward-hacking checks.

DPO

DPO (Direct Preference Optimization) drops the explicit reward model and the RL loop. It still optimizes the RLHF objective (reward maximization with a KL-divergence constraint), but it does so as a reparameterized supervised learning problem instead of with reinforcement learning:

{
    "prompt": "Explain quantum computing",
    "chosen": "Quantum computing uses qubits...",   # Preferred response
    "rejected": "Well, it's complicated..."        # Non-preferred response
}

What changes operationally:

Simpler code path (no separate reward model, no RL loop).
An offline objective over preference pairs rather than on-policy reinforcement learning.
A reference policy or equivalent reference log probabilities in the standard formulation.

DPO is easier to prototype than a full PPO pipeline, but it is not an automatic quality upgrade. Results depend on the starting policy, pair quality, loss settings, length effects, and evaluation protocol. Compare it with an SFT checkpoint on the same held-out preference and task suites.

ORPO

ORPO (Odds-Ratio Preference Optimization) combines the SFT negative-log-likelihood loss with an odds-ratio penalty on rejected responses. It removes the separate reference model and can combine task learning and preference optimization in one run.

How it works: ORPO uses a combined loss that does two things at once:

Maximizes the likelihood of the chosen response (learning the task).
Penalizes the rejected response with an odds-ratio term (learning preferences).

Hyperparameters worth knowing:

from trl import ORPOConfig

config = ORPOConfig(
    learning_rate=8e-6,  # Very low, as recommended by the ORPO paper
    beta=0.1,            # Controls strength of preference penalty
    # ... other params
)

Learning rate: the paper used low rates in its experiments; tune against your model, batch, and data rather than copying one value as a rule.
Beta: controls the preference term relative to the SFT term.

The trade-off:

One training stage instead of two.
No reward model.
No reference-model forward pass.
A coupled run: if task learning or preference behavior regresses, there is no intermediate SFT checkpoint from the same pipeline to inspect.

Choose through the data and evaluation design:

Use DPO when you already have a satisfactory SFT checkpoint and want a simpler offline preference experiment.
Test ORPO when a reference-free, single-stage objective fits your data and operational constraints.
Use PPO-based RLHF when online sampling against an explicit learned reward is part of the requirement and you can monitor reward exploitation.

None is a default across tasks. Keep an SFT-only baseline and report both task metrics and preference metrics.

Fine-tuning frameworks

Frameworks overlap and change quickly. Pick by the execution path you must support, pin versions, and keep the training configuration portable enough to reproduce outside a notebook.

Unsloth — speed and memory efficiency

Unsloth integrates with Hugging Face trl and transformers and provides optimized kernels, checkpointing, and quantized fine-tuning paths for supported models.

Custom Triton GPU kernels for attention, RoPE, and cross-entropy that skip PyTorch overhead.
Memory-efficient backprop that recomputes activations during the backward pass instead of holding them in memory.
Fused operations that collapse multiple steps (layer norm + linear, and similar) into single GPU calls.
4-bit quantization integrated directly into the QLoRA path with optimized dequantization.

[!IMPORTANT] Import order matters Follow the import order in the Unsloth example for the version you pin. Unsloth applies patches during import, so importing it before trl and transformers avoids missing optimizations or version-specific errors.

# ✅ Correct order
from unsloth import FastLanguageModel  # Must be first!
from trl import SFTTrainer
from transformers import TrainingArguments

# Avoid this order with Unsloth's patched path
from trl import SFTTrainer
from unsloth import FastLanguageModel

Best for: single-GPU training, prototyping, Colab notebooks, anyone watching the GPU bill.

Published speed and memory numbers vary by model, sequence length, batch, precision, and hardware. Benchmark tokens per second and peak memory on your own run instead of treating a headline ratio as a framework property.

Axolotl — config-driven training

# config.yaml - no code required
base_model: meta-llama/Meta-Llama-3-8B
adapter: qlora
lora_r: 32
lora_alpha: 16
datasets:
    - path: data/my_data.jsonl
      type: alpaca
sample_packing: true

Run with: accelerate launch -m axolotl.cli.train config.yaml

Best for: declarative, reproducible runs and built-in launcher options for distributed training.

The main advantage is declarative configuration that can be reviewed, versioned, and reused across local and distributed runs.

Framework comparison

Tool	Useful when	Verify before committing
Unsloth	You want an optimized supported-model path with concise examples	Model, GPU, quantization, and distributed-support matrix
Axolotl	You want declarative configs and built-in distributed recipes	Exact config schema and launcher for the pinned release
TRL	You want direct access to Hugging Face SFT and preference trainers	Dataset format, chat template, loss masking, and PEFT integration
Torchtune	You want PyTorch-native recipes and components	Model recipe coverage and export compatibility

Practical demo: fine-tuning with Unsloth

Here’s a complete example from my unsloth-finetune-demo repository. The demo fine-tunes Nemotron-Nano for function calling.

Training Pipeline

Quick start

# Clone and setup
git clone https://github.com/slavadubrov/unsloth-finetune-demo.git
cd unsloth-finetune-demo

# Install with uv (recommended)
uv sync

# Run fine-tuning (quick test)
uv run finetune --max-samples 1000

Configuration

The interesting parts live in config.py:

# Model & Dataset
MODEL_NAME = "nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1"  # 4B params, 128K context
DATASET_NAME = "glaiveai/glaive-function-calling-v2"   # 113K examples

# LoRA Configuration
LORA_R = 16        # Adapter capacity; tune against held-out results
LORA_ALPHA = 32    # Update scaling; alpha/r is the classic LoRA scale
MAX_SEQ_LENGTH = 4096

# Candidate modules for this Llama-family model
LORA_TARGET_MODULES = [
    "q_proj", "k_proj", "v_proj", "o_proj",
    "gate_proj", "up_proj", "down_proj",
]

[!NOTE] The alpha-to-rank ratio alpha/r scales the classic LoRA update. alpha = 2r is a common starting heuristic in some tool documentation, not a stability guarantee. Sweep rank, alpha, learning rate, and target modules only after the data and baseline are fixed.

Core training code

from unsloth import FastLanguageModel
from trl import SFTConfig, SFTTrainer

# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1",
    max_seq_length=4096,
    load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",  # Lower activation memory; extra compute
)

# The data step creates versioned train_dataset and eval_dataset objects.
# Each row has the chat messages and tool schemas expected by current TRL.

# Train with the current TRL configuration surface.
trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=SFTConfig(
        output_dir="outputs/nemotron-function-calling",
        max_length=4096,
        packing=True,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        num_train_epochs=3,
        bf16=True,
    ),
)
trainer.train()

Fine-tuning with Axolotl

[!NOTE] Demo coming I’m working on a hands-on Axolotl demo. Until then, the Accelerate n-D Parallelism Guide from Hugging Face is a good reference for multi-GPU training strategies.

For config-first and distributed setups, Axolotl makes the workflow reproducible:

# axolotl_config.yaml
base_model: meta-llama/Meta-Llama-3-8B
model_type: LlamaForCausalLM

# QLoRA configuration
load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj

# Dataset
datasets:
    - path: data/training_data.jsonl
      type: alpaca

# Training settings
sequence_len: 4096
sample_packing: true # Benchmark with your length distribution
micro_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 0.0002
num_epochs: 3

# Precision and attention path; verify support on the pinned stack
bf16: true
flash_attention: true

Run training:

axolotl train axolotl_config.yaml

Stage 5: Evaluation

Freeze the evaluation contract before the first run. At minimum, compare the tuned checkpoint with the exact untuned base under the same prompt, decoding settings, and tool environment. Report aggregate quality only after checking the failure slices the project was meant to improve.

Track four groups:

Target task: exact match, execution success, human rubric, or another outcome tied to the use case.
Regression: general abilities and previously supported task slices that adaptation could damage.
Safety and policy: refusals, data leakage, prompt injection, or domain-specific constraints.
Operations: latency, throughput, memory, artifact size, and cost at the intended serving configuration.

Automated benchmarks

Use lm-evaluation-harness for relevant standardized tasks, not as a substitute for the product evaluation:

lm_eval --model hf \
    --model_args pretrained=./outputs/merged-model \
    --tasks hellaswag,arc_easy,mmlu \
    --batch_size 8

LLM-as-judge

For subjective quality, a larger model can assist with scoring, but calibrate it against human-reviewed examples and keep the candidate identity hidden:

judge_prompt = """
Rate this response from 1-5 on:
- Relevance
- Accuracy
- Formatting

Response: {model_output}
Expected: {ground_truth}
"""

Domain-specific evaluation

Hold out real examples by source, user, document, or time so near-duplicates cannot leak across the split. For function calling, validate the full trajectory: tool selection, arguments, execution result, recovery, and final response. Report confidence intervals or paired win/loss counts when the sample is small, and inspect every regression in a critical slice.

Stage 6: Deployment and output formats

Choose the artifact based on the serving engine and rollback plan, not only file size:

Output Formats

1. LoRA adapter

uv run finetune  # Saves ~100-500MB adapter

Size: proportional to targeted modules, rank, layers, and dtype; often far smaller than the base.
Best for: development, versioned task adapters, and engines that support LoRA directly.
Bonus: you can swap adapters without re-downloading the base model.

2. Merged model

uv run finetune --merge  # Creates a standalone full model

Size: approximately the full base checkpoint at the chosen output precision.
Best for: engines or distribution paths that do not support the adapter separately.
Trade-off: larger artifact and slower rollout; simpler single-model loading.

3. GGUF format

uv run finetune --gguf q4_k_m  # Creates ~2-4GB quantized model

Size: model-dependent; roughly four-bit weights plus metadata for Q4 variants.
Best for: CPU inference, Ollama, llama.cpp, edge deployment.
Options: q4_k_m (smaller), q5_k_m (more weight fidelity), q8_0 (larger and higher fidelity). Measure the task impact after conversion.

Stage 7: Serving and monitoring

With vLLM

# Serve the base and expose a PEFT adapter as a model name.
vllm serve nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1 \
    --enable-lora \
    --lora-modules function-calling=./outputs/adapter \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 4096

Query via the OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
response = client.chat.completions.create(
    model="function-calling",
    messages=[{"role": "user", "content": "Book a flight to Tokyo"}]
)

With Ollama (local)

# Create Modelfile
echo 'FROM ./outputs/unsloth-nemotron-function-calling-gguf/model-q4_k_m.gguf' > Modelfile

# Import to Ollama
ollama create my-function-model -f Modelfile

# Run
ollama run my-function-model

With llama.cpp (CPU)

./llama-cli -m ./outputs/model-q4_k_m.gguf \
    -p "What's the weather in Tokyo?" \
    --ctx-size 4096

Monitor the released model

The lifecycle does not stop at a healthy training loss. Record the base-model revision, tokenizer and chat template, adapter hash, dataset version, training configuration, and evaluation report as one release unit. In production, monitor task success, invalid outputs, policy failures, latency, and input drift by the same slices used offline. Keep the previous artifact loadable and define a rollback threshold before launch.

Key takeaways

Fine-tune only after an untuned baseline and failure taxonomy show that weight adaptation addresses the problem.
Retrieval manages changing evidence; constrained decoding manages syntax; neither is replaced by SFT.
LoRA reduces trainable state. QLoRA additionally compresses frozen base weights. Do not attribute QLoRA memory figures to LoRA.
Data coverage, split integrity, provenance, and loss masking matter more than copying a fashionable optimizer configuration.
DPO, ORPO, and PPO-based RLHF are different experimental designs, not a quality ladder with a universal default.
Evaluate target behavior, regressions, safety, and operations against the same base model.
Pick adapter, merged, or GGUF output from the serving and rollback requirements before training.

References

Papers and research

Data processing tools

DataTrove — Hugging Face data processing at scale
Distilabel — Synthetic data generation (Argilla)
Trafilatura — Web text extraction and crawling
FastText — Facebook AI language identification (supports 217 languages)
Microsoft Presidio — PII detection and anonymization
scrubadub — Python library for PII removal

Constrained decoding

xgrammar — Constrained decoding with FSMs
outlines — Structured generation for LLMs

Training frameworks

Unsloth — Optimized fine-tuning framework
Axolotl — Config-driven training and distributed launchers
TRL — Hugging Face SFT and preference-training library
Torchtune — PyTorch-native fine-tuning library

Inference and deployment

vLLM LoRA adapters — Serve one or more adapters with the base model
Ollama — Local LLM runner for Mac/Windows/Linux
llama.cpp — CPU/GPU inference with GGUF format

Evaluation

lm-evaluation-harness — EleutherAI standardized LLM benchmarking

Guides and resources

Demo Repository — Practical fine-tuning example
LLM Fine-Tuning. Theoretical Intuition and Practical Implementation — NotebookLM research notebook
Accelerate n-D Parallelism Guide — Hugging Face multi-GPU training strategies