Offline-First Fleet Telemetry: Raspberry Pi 5 Gateways with Local Routing and Delayed Sync
Architectural guide to offline-first fleet telemetry with Raspberry Pi 5 + AI HAT+ 2, local decisioning, and delayed ClickHouse sync.
Hook — Stop losing telemetry when networks fail: build an offline-first fleet gateway
If you operate a fleet of vehicles, drones, or remote sensors, you know the pain: bursty cellular coverage, unpredictable satellite windows, and high egress bills all conspire to make real-time telemetry unreliable. This guide shows a battle-tested architecture for resilient, offline-first fleet telemetry using Raspberry Pi 5 + AI HAT+ 2 gateways with the new AI HAT+ 2 for local decisioning and a robust delayed sync strategy to central analytics (ClickHouse). You’ll leave with concrete deployment recipes, CI/CD artifacts, and operational patterns to run a fleet that keeps navigating and reporting even when the cloud is unreachable.
Top-level architecture (most important first)
Design goals: survive network outages, minimize bandwidth, perform local navigation/decisioning, and make central analytics reliable and idempotent when sync resumes. At a glance:
- Raspberry Pi 5 + AI HAT+ 2 as the local gateway and inference node.
- Local services: lightweight message broker (MQTT), buffering store (SQLite/LevelDB), local router/policy engine, and sync agent.
- Edge decisioning: model inference on AI HAT+ 2 for navigation or anomaly detection, with a model registry and versioning.
- Delayed sync: batch, compress, sign, and push to central ClickHouse during windows of good connectivity.
- CI/CD + OTA: build multi-arch container images and push updates using GitHub Actions + balena/Mender or Fleet/Ansible for staged rollouts.
ASCII diagram (quick map)
[Sensors/Vehicle CAN] --> Pi5+AIHAT2 --> Local Broker --> Buffer Store
|--> Decision Engine (edge ML)
|--> Local Router (policy/rules)
--> Sync Agent --(cell/wifi)--> Central Ingest (ClickHouse)
Why this matters in 2026
Edge AI hardware like the AI HAT+ 2 has made practical on-device inference for navigation and anomaly detection viable in 2025–2026, bringing powerful models to Raspberry Pi-class devices. At the same time ClickHouse continues to accelerate as a high-throughput OLAP backend (noting major funding and broad enterprise adoption in late 2025), making it the realistic target for fleet analytics. Combining local decisioning with a robust delayed-sync pipeline reduces cost and improves safety: local decisions keep vehicles operating, while batched analytics keep your ML ops and long-term analytics intact.
Core components and responsibilities
1. Local gateway (Raspberry Pi 5 + AI HAT+ 2)
Responsibilities: collect telemetry, run inference, apply routing/policy, persist a local journal, and drive sync. Practical tips:
- Use 64-bit Raspberry Pi OS or a minimal Debian arm64 base. Pi 5 performance makes multi-threaded ingestion and inference feasible.
- Leverage the AI HAT+ 2 for quantized models (ONNX/TFLite) for local decisioning—avoid huge transformer models unless heavily optimized.
- Run critical services as containers or systemd units for easier lifecycle management.
2. Local messaging & buffer
Use an embedded MQTT broker (e.g., mosquitto) or lightweight NATS for pub/sub between telemetry producers and consumers. Persist messages into a write-ahead local store (SQLite for small volumes, RocksDB/LevelDB for higher throughput).
# Minimal SQLite schema for telemetry journaling
CREATE TABLE telemetry_journal (
seq BIGINT PRIMARY KEY,
received_ts DATETIME DEFAULT CURRENT_TIMESTAMP,
device_id TEXT,
payload BLOB,
processed BOOLEAN DEFAULT 0
);
3. Edge decisioning
Decisioning examples: navigation reroute, obstacle classification, sensor fusion. Patterns:
- Keep the model inference deterministic and versioned. Store model metadata in the gateway and use a periodic model-checker to fetch updates. See notes on model registry and versioning.
- Prefer local thresholding + lightweight classifiers for safety-critical decisions; only send summaries upstream to save bandwidth.
- Log inference metadata (model version, confidence, inputs hash) to the telemetry journal for later evaluation in ClickHouse.
4. Local routing & policy engine
Routing here is logical: decide which messages are critical (must sync immediately when possible), which are low-priority (batch), and which are ephemeral (discard after X). Use policy rules and service-level priorities:
- Priority 1: safety events (send ASAP, keep retrying until acked)
- Priority 2: navigation snapshots (batched with dedup)
- Priority 3: telemetry metrics (aggregate and compress)
5. Sync agent
The sync agent coordinates upload windows, controls bandwidth, compresses batches, and guarantees idempotent inserts into ClickHouse. Key features:
- Checkpointing via sequence numbers and persisted watermarks.
- Adaptive backoff and bandwidth shaping (e.g., token-bucket) to avoid saturating metered links.
- Multi-format output: JSONEachRow for small batches, compressed CSV or Parquet for bulk loads.
Data model & ClickHouse integration
Design ClickHouse tables for append-only telemetry with idempotency and efficient queries.
CREATE TABLE telemetry_raw (
device_id String,
event_ts DateTime64(3),
seq UInt64,
payload String,
model_version String,
processed UInt8
) ENGINE = ReplacingMergeTree(seq)
PARTITION BY toYYYYMM(event_ts)
ORDER BY (device_id, seq)
SETTINGS index_granularity=8192;
Why ReplacingMergeTree? It lets you insert duplicates during resync and rely on seq to keep the latest (or use an additional sign column). Use TTL to expire old raw payloads and aggregate into downsampled tables for cost control. For a broader take on storage and cost tradeoffs see A CTO’s Guide to Storage Costs.
Best practices for ingestion
- Use HTTP POST to ClickHouse with FORMAT JSONEachRow for simplicity. For high throughput use native or binary formats and gzip compression.
- Batch size: 1–10k rows or up to 5–10 MB per request; compress with gzip.
- Always include device_id + seq for idempotency. Use ReplacingMergeTree or CollapsingMergeTree for dedup semantics.
# Example curl push (basic)
curl -sS - --data-binary @batch.json "https://clickhouse.example.com/?query=INSERT%20INTO%20telemetry_raw%20FORMAT%20JSONEachRow" \
-H "Content-Encoding: gzip" \
--cert device.crt --key device.key
Offline-first sync patterns
Checkpointing and sequence numbers
Keep a monotonically increasing seq per device. Store a durable checkpoint (last-acked seq). When connectivity returns, upload segments from checkpoint+1 to head. This ensures reliable ordered replay.
Batching, compression, and deduplication
- Aggregate similar telemetry (e.g., GPS pings) by sampling or delta-encoding to shrink payloads.
- Compress batches with gzip or zstd—zstd is increasingly standard in 2026 for its speed and compression ratio.
- Include hash or UUID per event to detect duplicates when ClickHouse dedup semantics are insufficient.
Adaptive sync windows
Implement sync policies that respect cost and latency profiles. Typical strategies:
- Immediate sync for safety-critical events using small reliable messages and persistent retries.
- Scheduled bulk sync during known-good windows (e.g., when connected to HQ wifi or during low-cost satellite windows).
- Network-aware policy—use signal strength, measured latency, and available bandwidth to control batch size. For practical connectivity planning, consult The Road-Trip Phone Plan.
Conflict handling and idempotency
Because gateways may re-send events after reboots or reconnections, design for idempotent ingestion:
- Use (device_id, seq) as a natural dedup key. Implement ReplacingMergeTree keyed on seq so duplicates collapse.
- For state updates, write events as append-only and derive state with materialized views or aggregation jobs.
- Store event metadata (ingest_ts, source_gateway_id) for auditability.
Security and data integrity
Secure both transport and local storage:
- mTLS for uploads to ClickHouse; device certificates provisioned at build/deploy time.
- Encrypt local persistent store at rest (LUKS or application-layer AES) for sensitive data.
- Signed batches: sign compressed payloads with the device private key to detect tampering.
- Limit inbound network services on the Pi with nftables/iptables; expose only necessary ports to local networks.
CI/CD and OTA for models and software (practical walkthrough)
Goal: reproducible, multi-arch builds for Pi 5 (arm64), automated testing, and staged rollouts.
Dockerfile (multi-arch) example
FROM --platform=$BUILDPLATFORM python:3.11-slim AS builder
WORKDIR /src
COPY requirements.txt ./
RUN pip wheel -r requirements.txt -w /wheels
FROM --platform=linux/arm64 python:3.11-slim
COPY --from=builder /wheels /wheels
RUN pip install --no-index --find-links=/wheels -r /src/requirements.txt
COPY . /app
WORKDIR /app
CMD ["python","-u","gateway_service.py"]
GitHub Actions (build & push) snippet
name: Build & Push
on: [push]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up QEMU
uses: docker/setup-qemu-action@v2
- name: Buildx
uses: docker/setup-buildx-action@v2
- name: Login to Registry
uses: docker/login-action@v2
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push
uses: docker/build-push-action@v5
with:
platforms: linux/arm64,linux/amd64
push: true
tags: ghcr.io/org/gateway:latest
Deployment and staged rollout
Use balenaCloud, Mender, or your fleet manager to orchestrate rollouts. Staging pattern:
- Push image to registry via CI.
- Canary deploy to 1–2 gateways (smoke tests, telemetry checks).
- If green, promote to 10% group, run regression, then full rollout.
Automate health checks: boot time, service heartbeats, inference output sanity checks. Roll back automatically on failure. For real-world micro-deployment patterns and tooling, see Micro Apps Case Studies.
Operational considerations: monitoring, costs, and scaling
Monitor both local health and sync health:
- Local metrics: queue depth, disk usage, CPU/temperature (AI workloads raise temp), inference latency and confidence distributions.
- Sync metrics: success rate, bytes transferred per window, retry counts, lag (seq gap).
- Central metrics in ClickHouse: ingestion throughput, dedup counts, late-arriving events.
Bandwidth and cost controls:
- Smart sampling: reduce GPS frequency when stationary.
- Edge aggregation: send summaries (min/median/max) instead of raw streams.
- Use satellite/cellular plan awareness: avoid heavy uploads on metered links.
Advanced patterns and future-proofing
- Hybrid compute: run heavier model retraining in the cloud, but push distilled models to AI HAT+ 2 for inference. Keep model registry in Git/Git LFS or an OCI artifact registry.
- Federated learning light: send aggregated gradients or model statistics (differentially private) rather than raw telemetry if privacy requires it.
- Edge-to-edge routing: allow nearby gateways to peer for local sync when central connectivity is down—use a mesh (WireGuard + mDNS) and elect a sync leader.
Sample operational checklist
- Provision device cert + initial model during manufacturing or first boot.
- Start local broker and journal; ensure journal storage has watermarking.
- Enable inference shim that writes model metadata alongside telemetry.
- Configure sync agent with bandwidth policy and ClickHouse endpoint (mTLS credentials).
- Deploy health checks and canary CI/CD workflows.
Actionable code snippets
Minimal sync logic (Python pseudocode):
import sqlite3, gzip, requests
def read_batch(conn, last_seq, limit=1000):
cur = conn.cursor()
cur.execute('SELECT seq, device_id, payload, model_version FROM telemetry_journal WHERE seq > ? ORDER BY seq LIMIT ?', (last_seq, limit))
rows = cur.fetchall()
return rows
def make_payload(rows):
for r in rows:
yield {"device_id": r[1], "seq": r[0], "payload": r[2], "model_version": r[3]}
def push_to_clickhouse(batch, endpoint, cert, key):
data = '\n'.join(json.dumps(r) for r in batch).encode('utf-8')
gz = gzip.compress(data)
resp = requests.post(endpoint, data=gz, headers={'Content-Encoding':'gzip'}, cert=(cert, key))
resp.raise_for_status()
return True
2026 trends and predictions (short)
- Edge AI hardware like AI HAT+ 2 becomes standard in fleet gateways — expect consistent reductions in inference TCO by 2027.
- ClickHouse adoption continues to grow in telemetry analytics; design patterns for offline-first ingestion become first-class in analytics stacks.
- Bandwidth-aware sync and privacy-preserving aggregation will be required by enterprises by 2026–2027 as regulatory and cost pressures increase.
Key takeaways (actionable)
- Design for ordered, idempotent ingestion (device_id + seq) so delayed syncs never corrupt analytics.
- Use local decisioning on AI HAT+ 2 to keep navigation safe and reduce cloud egress costs.
- Batch, compress, and checkpoint aggressively—use zstd/gzip and adaptive windows.
- Automate CI/CD and staged rollouts with multi-arch builds and canary groups to avoid mass failures.
- Monitor both device and sync metrics and set hard limits for disk/queue size to prevent uncontrolled backlog growth.
Resilience is not just redundancy — it’s local intelligence, clear ordering, and predictable sync semantics.
Next steps / Call to action
Ready to build this architecture? Download our reference repo (includes Dockerfiles, a sync agent, ClickHouse schema, and GitHub Actions templates) and run the provided Pi image in test mode. If you need help designing a production rollout or tuning ClickHouse for high-cardinality fleet data, contact our engineering team for a workshop or scheduled audit.
Deploy smarter: start with the repo, run a 3-node Pi5 + AI HAT+ 2 test cluster, and simulate intermittent connectivity to validate your delayed-sync policies—then graduate to staged field rollouts.
Related Reading
- Edge-First Patterns for 2026 Cloud Architectures: Integrating DERs, Low-Latency ML and Provenance
- Why On‑Device AI Is Now Essential for Secure Personal Data Forms (2026 Playbook)
- A CTO’s Guide to Storage Costs: Why Emerging Flash Tech Could Shrink Your Cloud Bill
- Field Guide: Hybrid Edge Workflows for Productivity Tools in 2026
- Travel Shoe Fit: When to Invest in Insoles and When to Skip Them
- Scale-Up Secrets for Food Entrepreneurs: What Home Kitchens Can Learn from Liber & Co.'s Growth
- The Ultimate Winter Bedtime Routine for Stylish Men
- Hanging Out With Hosts: How Celebrity Podcasts Drive Conversation About Science
- When Fan Work Disappears: The Emotional Toll and How Creators Rebuild Audiences
Related Topics
webdecodes
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
From Our Network
Trending stories across our publication group