edge mlllmraspberry pi

Run LLM Inference on Raspberry Pi 5 Offline: Model Pruning, Quantization, and Deployment Patterns

wwebdecodes

2026-02-08

10 min read

Practical walkthrough to run quantized LLMs on Raspberry Pi 5 + AI HAT+ 2—pruning, ONNX int8 toolchain, memory trade-offs, and micro-app patterns.

Hook: Ship offline LLM features on Raspberry Pi 5 without losing sleep

If you’re a developer or IT pro trying to run a practical LLM on-device, you already know the two main pain points: memory constraints and performance vs. accuracy trade-offs. The Raspberry Pi 5 plus the new AI HAT+ 2 (released late 2025) makes offline generative AI plausible for micro apps — but only if you compress models correctly, pick the right runtime, and design an inference pipeline that respects the Pi’s memory and thermal envelope. This guide walks you through model pruning, int8 quantization via ONNX, deployment patterns, and real-world trade-offs so you can ship an offline micro app that actually performs.

What’s changed in 2026: Why this matters now

Edge ML moved from demos to production between 2023–2026. Two trends matter for Raspberry Pi 5 in 2026:

Specialized HATs like the AI HAT+ 2 include vendor runtimes and ONNX Execution Providers that let you offload quantized workloads to hardware accelerators, drastically improving throughput and lowering CPU memory pressure.
Quantization toolchains matured: ONNX Runtime and community tools now support robust post-training static int8 and dynamic int8 workflows for many transformer blocks, making 4–8× memory reductions realistic while retaining usable quality for micro apps.

High-level approach

Pick the right model family and size for the use case (prefer 1.3B–3B for most Pi deployments).
Prune and/or distill to reduce parameter count if you can afford an extra offline training pass.
Convert to ONNX with operator compatibility in mind.
Quantize to int8 (static where possible) using ONNX Runtime quantization tools and a small calibration dataset.
Deploy to Raspberry Pi 5, attach AI HAT+ 2 runtime, and use the vendor ONNX EP (execution provider) or CPU EP if EP isn’t available.
Optimize runtime: memory mapping, thread and affinity tuning, streaming token outputs for micro apps.

Step 1 — Choose model and pruning strategy

Start with a model that matches your target latency and RAM budget. In 2026, a practical rule-of-thumb on Pi 5 (8GB/16GB variants) is:

1B–2B: Best latency and fit for interactive micro apps with large vocab (chatbots, home automation).
3B: Good balance — may require aggressive quantization and the HAT+ 2 offload for smooth streaming.
>7B: Difficult on Pi 5, even with HAT acceleration; expect heavy offloading and compromises.

If you control training, apply structured or unstructured pruning and optionally knowledge distillation to a smaller student model. A simple pipeline:

Fine-tune teacher on your domain (optional).
Prune weights using magnitude pruning for linear layers, followed by a short re-training (5–10 epochs at low LR).
Distill logits to a 1–3B student model.

Pruning reduces size but can increase inference sparsity that many runtimes don’t exploit. Use pruning primarily to reduce training time and model size before quantization; rely on quantization for runtime memory savings.

Step 2 — Export to ONNX

ONNX is the neutral format you’ll use to run models on the HAT+ 2 runtime or ONNX Runtime with custom execution providers. The basic export pipeline from a PyTorch Hugging Face model looks like this (replace MODEL and TOKENIZER appropriately):

from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.onnx import export
from pathlib import Path

model_name = "my-fine-tuned-3b"
repo_dir = Path("./onnx-model")
repo_dir.mkdir(exist_ok=True)

# Use the HF export helper for causal LM
export(model_name, opset=18, output=repo_dir / "model.onnx", tokenizer=None)

Notes:

Use an opset >= 13 for better transformer op support; many vendor EPs recommend opset 18 as of late 2025/2026.
Test the ONNX graph locally with onnx.checker and a small tokenized input to detect unsupported ops early.

Step 3 — Quantize to int8 with ONNX Runtime

ONNX Runtime provides two main quantization workflows: dynamic (weights quantized at runtime) and static (requires calibration data). Static int8 is the most memory-efficient and usually yields better accuracy, but needs a small calibration corpus (1–2k tokens is typical for domain-specific LLMs).

Static quantization (recommended if you can provide calibration)

pip install onnxruntime onnxruntime-tools onnx

from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantType
import onnx

class TextCalibrationDataReader(CalibrationDataReader):
    def __init__(self, tokenized_inputs):
        self.data = tokenized_inputs
        self.iter = None

    def get_next(self):
        if self.iter is None:
            self.iter = iter(self.data)
        try:
            inputs = next(self.iter)
            return {k: v for k, v in inputs.items()}
        except StopIteration:
            return None

model_fp = "model.onnx"
calib_inputs = [...]  # list of tokenized dicts (input_ids, attention_mask)
reader = TextCalibrationDataReader(calib_inputs)
quantize_static(model_fp, "model-int8.onnx", reader, quant_format=QuantType.QOperator)

Key tips:

Use QuantType.QOperator for better compatibility with vendor EPs; some EPs prefer QDQ format. Test both.
Keep the calibration dataset representative of expected prompts to avoid large accuracy drops.

Dynamic quantization (fast, no calibration)

from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic("model.onnx", "model-int8-dyn.onnx", weight_type=QuantType.QInt8)

Dynamic quantization is simpler but typically less compact. Still useful for rapid prototyping on Pi.

Step 4 — Verify accuracy and latency trade-offs

Before deploying, run a small battery of quality and latency tests:

Perplexity or token-likelihood comparison between FP16 and int8 on a test set.
Latency (ms/token) at different batch sizes and with the Ai HAT+ 2 attached.

Acceptable deltas vary by app. For command-and-control micro apps, a 2–4% quality drop is often fine if latency improves 2–5×.

Step 5 — Prepare Raspberry Pi 5 and AI HAT+ 2

Set up steps (assume Raspberry Pi OS 64-bit or a compatible Debian 12/13 build):

Update system and install prerequisites: Python 3.11, pip, build-essential.
Install ONNX Runtime wheel that matches Pi’s architecture (aarch64) and the AI HAT+ 2 vendor runtime. Vendors usually publish an ONNX Execution Provider (EP) or a custom runtime package — install that to enable offload.
Enable swapfile (cautiously) and tune swappiness for occasional out-of-core activations; but prefer memory-mapping and offload to HAT+ 2 whenever possible.

Example commands:

sudo apt update && sudo apt upgrade -y
sudo apt install python3-venv python3-pip build-essential -y
python3 -m venv venv && source venv/bin/activate
pip install onnxruntime-aarch64== onnx numpy
# Install vendor HAT+2 runtime (follow vendor guide) - often a .deb or pip package

Step 6 — Inference pipeline (production-ready pattern)

Design your micro-app to minimize memory spikes and maximize streaming responsiveness. A recommended pipeline:

Tokenize input locally using a lightweight tokenizer (load once at process start).
Use a memory-mapped ONNX model file or load quantized ONNX into the HAT+ 2 EP to avoid copying large tensors into RAM.
Run generation loop with small per-step decoding (top-p/top-k) and stream tokens back over a lightweight HTTP/gRPC endpoint.

Python example (simplified):

import onnxruntime as ort
from tokenizer import Tokenizer

sess_opts = ort.SessionOptions()
# Vendor EP should be added if available, e.g. sess = ort.InferenceSession(model, sess_opts, providers=['HAT2EP', 'CPUExecutionProvider'])

sess = ort.InferenceSession('model-int8.onnx', sess_options=sess_opts, providers=['CPUExecutionProvider'])

tokenizer = Tokenizer('vocab.json')

def stream_generate(prompt, max_tokens=64):
    input_ids = tokenizer.encode(prompt)
    for _ in range(max_tokens):
        ort_inputs = {"input_ids": input_ids}
        outputs = sess.run(None, ort_inputs)
        next_token = decode_token(outputs)
        yield tokenizer.decode([next_token])
        input_ids.append(next_token)

Important: Use small generation steps and avoid concatenating huge histories in memory. For longer contexts, implement a sliding-window cache of key/value tensors if the EP supports KV-caching.

Memory engineering: tricks that matter

Memory-map ONNX model file: Reading parameters directly off disk via mmap reduces peak RSS compared to loading all tensors into Python memory.
Use int8 weights — they cut model weight size ~4× relative to fp32 (and ~2× vs. fp16) which usually fits the Pi 5 memory with headroom for token buffers.
KV caching and token windowing — avoid storing full token histories; evict older tokens or compress cached keys with lower precision if acceptable.
Swap as last resort — configure a zram swap and low swappiness; plain swap-on-disk will kill latency.
Thread and affinity tuning — pin threads to physical cores, lower Python GIL contention with worker processes if multi-session serving is required.

Deployment patterns for micro apps

1. Local personal assistant (single device)

Run as systemd service with a small HTTP API that streams tokens via Server-Sent Events (SSE).
Use certificate-based local auth; no cloud connectivity ensures privacy.

2. Edge microservice (LAN or small office)

Containerize with an aarch64 base image (Debian slim). Use multi-stage builds to avoid shipping build tools in production images.
Expose gRPC for low-latency calls; implement rate limits and request queueing to avoid overcommit.

3. Fleet deployment (many Pis)

Use a lightweight orchestrator (Balena, K3s) and a CI pipeline that produces optimized ONNX artifacts per Pi hardware variant.
Rolling update strategy: push quantized models incrementally; include graceful fallback to CPU FP16 if HAT+ 2 fails or disconnects.

Troubleshooting checklist

Model won’t load: check opset version and unsupported ops; use onnxruntime-tools to graph-check.
Quality drop too high after int8: expand calibration dataset, try QDQ vs. QOperator formats, or run mixed precision (fp16 for attention, int8 for feed-forward).
High swap usage: reduce batch size, enable zram with a conservative limit, or increase swapiness carefully.
HAT+ 2 not used: verify EP registration (ort.get_available_providers()) and vendor runtime logs.

Performance numbers and expectations (realistic)

Performance depends on model size and whether the HAT+ 2 EP is used. Typical ballpark (2026 baseline):

1.3B model int8 on CPU-only Pi 5: ~30–60 ms/token.
1.3B model int8 with AI HAT+ 2 EP: ~8–20 ms/token (varies by EP maturity and thermal throttling).
3B model int8 with HAT+ 2: ~15–40 ms/token.

Measure with a reproducible harness and test representative prompts. Keep in mind that long-running sessions may throttle thermally — design with cooling and request pacing.

2026 advanced strategies and future-proofing

As of early 2026, these advanced techniques are becoming mainstream:

Hybrid quantization: Float16 for attention + int8 for feed-forward layers to balance fidelity and size.
Operator fusion at export-time: Fuse layernorm and linear ops to reduce kernel overhead on small devices.
Model patching: Live-patch smaller modules (domain adapters) without replacing the full model artifact for fast iteration on user-specific needs.

Plan your CI to produce both QOperator and QDQ artifacts, and include per-release microbenchmarks in pipelines so you can track regressions.

Security and privacy considerations

Keep model artifacts local and sign them for integrity checking — on-device inference is an advantage for sensitive data. See why identity and signing matter for technical controls.
Rate-limit generation to prevent resource exhaustion attacks on shared devices.
Log only high-level metrics; avoid storing raw prompts persistently unless required and encrypted.

Actionable checklist (get started in one afternoon)

Pick a 1.3B or 3B HF-compatible model and export to ONNX (opset 18).
Create a 1–2k token calibration set sampled from expected prompts.
Quantize static to int8 with ONNX Runtime and test quality vs. fp16.
Install ONNX Runtime on your Pi and the vendor EP for AI HAT+ 2; run a latency benchmark script.
Wrap inference in a small streaming HTTP API and deploy as a systemd service or container.

Case study: Personal recipe assistant (illustrative)

We built a privacy-first recipe assistant that runs on a Pi 5 + AI HAT+ 2. Key decisions:

Model: distilled 1.3B causal LM, static int8 quantized.
Pipeline: Token-level streaming with a 256-token sliding context for stateful conversations.
Deployment: Container with ONNX + HAT EP, served over a local HTTPS endpoint and authenticated using local OAuth tokens.

Outcome: Average 12 ms/token, 2–3% drop in recipe fidelity vs. fp16 baselines — acceptable for a single-user micro app.

Closing: Key takeaways

Quantization (int8) + ONNX is the practical path to make LLM inference feasible on Raspberry Pi 5.
Use static quantization with a representative calibration set for the best size/quality trade-off.
Pair with AI HAT+ 2 vendor runtime to offload compute and reduce CPU memory pressure.
Design inference pipelines for streaming small token steps, memory-mapped models, and KV cache management.

By combining pruning, static int8 quantization, and a hardware-aware runtime, you can run credible LLM-powered micro apps on a Raspberry Pi 5 — offline, private, and cost-effective.

Call to action

Ready to get hands-on? Clone our sample repo (includes ONNX export scripts, static quantization examples, and a streaming HTTP microservice) and try it on a Raspberry Pi 5 with AI HAT+ 2. Share your benchmark results or questions — we’ll iterate the deployment patterns and post a follow-up focused on container orchestration and fleet rollouts in 2026.

webdecodes

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.