edgefleetdeployment

Offline-First Fleet Telemetry: Raspberry Pi 5 Gateways with Local Routing and Delayed Sync

wwebdecodes

2026-02-13

10 min read

Architectural guide to offline-first fleet telemetry with Raspberry Pi 5 + AI HAT+ 2, local decisioning, and delayed ClickHouse sync.

Hook — Stop losing telemetry when networks fail: build an offline-first fleet gateway

If you operate a fleet of vehicles, drones, or remote sensors, you know the pain: bursty cellular coverage, unpredictable satellite windows, and high egress bills all conspire to make real-time telemetry unreliable. This guide shows a battle-tested architecture for resilient, offline-first fleet telemetry using Raspberry Pi 5 + AI HAT+ 2 gateways with the new AI HAT+ 2 for local decisioning and a robust delayed sync strategy to central analytics (ClickHouse). You’ll leave with concrete deployment recipes, CI/CD artifacts, and operational patterns to run a fleet that keeps navigating and reporting even when the cloud is unreachable.

Top-level architecture (most important first)

Design goals: survive network outages, minimize bandwidth, perform local navigation/decisioning, and make central analytics reliable and idempotent when sync resumes. At a glance:

Raspberry Pi 5 + AI HAT+ 2 as the local gateway and inference node.
Local services: lightweight message broker (MQTT), buffering store (SQLite/LevelDB), local router/policy engine, and sync agent.
Edge decisioning: model inference on AI HAT+ 2 for navigation or anomaly detection, with a model registry and versioning.
Delayed sync: batch, compress, sign, and push to central ClickHouse during windows of good connectivity.
CI/CD + OTA: build multi-arch container images and push updates using GitHub Actions + balena/Mender or Fleet/Ansible for staged rollouts.

ASCII diagram (quick map)


  [Sensors/Vehicle CAN] --> Pi5+AIHAT2 --> Local Broker --> Buffer Store
                                         |--> Decision Engine (edge ML)
                                         |--> Local Router (policy/rules)
                                         --> Sync Agent --(cell/wifi)--> Central Ingest (ClickHouse)

Why this matters in 2026

Edge AI hardware like the AI HAT+ 2 has made practical on-device inference for navigation and anomaly detection viable in 2025–2026, bringing powerful models to Raspberry Pi-class devices. At the same time ClickHouse continues to accelerate as a high-throughput OLAP backend (noting major funding and broad enterprise adoption in late 2025), making it the realistic target for fleet analytics. Combining local decisioning with a robust delayed-sync pipeline reduces cost and improves safety: local decisions keep vehicles operating, while batched analytics keep your ML ops and long-term analytics intact.

Core components and responsibilities

1. Local gateway (Raspberry Pi 5 + AI HAT+ 2)

Responsibilities: collect telemetry, run inference, apply routing/policy, persist a local journal, and drive sync. Practical tips:

Use 64-bit Raspberry Pi OS or a minimal Debian arm64 base. Pi 5 performance makes multi-threaded ingestion and inference feasible.
Leverage the AI HAT+ 2 for quantized models (ONNX/TFLite) for local decisioning—avoid huge transformer models unless heavily optimized.
Run critical services as containers or systemd units for easier lifecycle management.

2. Local messaging & buffer

Use an embedded MQTT broker (e.g., mosquitto) or lightweight NATS for pub/sub between telemetry producers and consumers. Persist messages into a write-ahead local store (SQLite for small volumes, RocksDB/LevelDB for higher throughput).


  # Minimal SQLite schema for telemetry journaling
  CREATE TABLE telemetry_journal (
    seq BIGINT PRIMARY KEY,
    received_ts DATETIME DEFAULT CURRENT_TIMESTAMP,
    device_id TEXT,
    payload BLOB,
    processed BOOLEAN DEFAULT 0
  );

3. Edge decisioning

Decisioning examples: navigation reroute, obstacle classification, sensor fusion. Patterns:

Keep the model inference deterministic and versioned. Store model metadata in the gateway and use a periodic model-checker to fetch updates. See notes on model registry and versioning.
Prefer local thresholding + lightweight classifiers for safety-critical decisions; only send summaries upstream to save bandwidth.
Log inference metadata (model version, confidence, inputs hash) to the telemetry journal for later evaluation in ClickHouse.

4. Local routing & policy engine

Routing here is logical: decide which messages are critical (must sync immediately when possible), which are low-priority (batch), and which are ephemeral (discard after X). Use policy rules and service-level priorities:

Priority 1: safety events (send ASAP, keep retrying until acked)
Priority 2: navigation snapshots (batched with dedup)
Priority 3: telemetry metrics (aggregate and compress)

5. Sync agent

The sync agent coordinates upload windows, controls bandwidth, compresses batches, and guarantees idempotent inserts into ClickHouse. Key features:

Checkpointing via sequence numbers and persisted watermarks.
Adaptive backoff and bandwidth shaping (e.g., token-bucket) to avoid saturating metered links.
Multi-format output: JSONEachRow for small batches, compressed CSV or Parquet for bulk loads.

Data model & ClickHouse integration

Design ClickHouse tables for append-only telemetry with idempotency and efficient queries.


  CREATE TABLE telemetry_raw (
    device_id String,
    event_ts DateTime64(3),
    seq UInt64,
    payload String,
    model_version String,
    processed UInt8
  ) ENGINE = ReplacingMergeTree(seq)
  PARTITION BY toYYYYMM(event_ts)
  ORDER BY (device_id, seq)
  SETTINGS index_granularity=8192;

Why ReplacingMergeTree? It lets you insert duplicates during resync and rely on seq to keep the latest (or use an additional sign column). Use TTL to expire old raw payloads and aggregate into downsampled tables for cost control. For a broader take on storage and cost tradeoffs see A CTO’s Guide to Storage Costs.

Best practices for ingestion

Use HTTP POST to ClickHouse with FORMAT JSONEachRow for simplicity. For high throughput use native or binary formats and gzip compression.
Batch size: 1–10k rows or up to 5–10 MB per request; compress with gzip.
Always include device_id + seq for idempotency. Use ReplacingMergeTree or CollapsingMergeTree for dedup semantics.


  # Example curl push (basic)
  curl -sS - --data-binary @batch.json "https://clickhouse.example.com/?query=INSERT%20INTO%20telemetry_raw%20FORMAT%20JSONEachRow" \
    -H "Content-Encoding: gzip" \
    --cert device.crt --key device.key

Offline-first sync patterns

Checkpointing and sequence numbers

Keep a monotonically increasing seq per device. Store a durable checkpoint (last-acked seq). When connectivity returns, upload segments from checkpoint+1 to head. This ensures reliable ordered replay.

Batching, compression, and deduplication

Aggregate similar telemetry (e.g., GPS pings) by sampling or delta-encoding to shrink payloads.
Compress batches with gzip or zstd—zstd is increasingly standard in 2026 for its speed and compression ratio.
Include hash or UUID per event to detect duplicates when ClickHouse dedup semantics are insufficient.

Adaptive sync windows

Implement sync policies that respect cost and latency profiles. Typical strategies:

Immediate sync for safety-critical events using small reliable messages and persistent retries.
Scheduled bulk sync during known-good windows (e.g., when connected to HQ wifi or during low-cost satellite windows).
Network-aware policy—use signal strength, measured latency, and available bandwidth to control batch size. For practical connectivity planning, consult The Road-Trip Phone Plan.

Conflict handling and idempotency

Because gateways may re-send events after reboots or reconnections, design for idempotent ingestion:

Use (device_id, seq) as a natural dedup key. Implement ReplacingMergeTree keyed on seq so duplicates collapse.
For state updates, write events as append-only and derive state with materialized views or aggregation jobs.
Store event metadata (ingest_ts, source_gateway_id) for auditability.

Security and data integrity

Secure both transport and local storage:

mTLS for uploads to ClickHouse; device certificates provisioned at build/deploy time.
Encrypt local persistent store at rest (LUKS or application-layer AES) for sensitive data.
Signed batches: sign compressed payloads with the device private key to detect tampering.
Limit inbound network services on the Pi with nftables/iptables; expose only necessary ports to local networks.

CI/CD and OTA for models and software (practical walkthrough)

Goal: reproducible, multi-arch builds for Pi 5 (arm64), automated testing, and staged rollouts.

Dockerfile (multi-arch) example


  FROM --platform=$BUILDPLATFORM python:3.11-slim AS builder
  WORKDIR /src
  COPY requirements.txt ./
  RUN pip wheel -r requirements.txt -w /wheels

  FROM --platform=linux/arm64 python:3.11-slim
  COPY --from=builder /wheels /wheels
  RUN pip install --no-index --find-links=/wheels -r /src/requirements.txt
  COPY . /app
  WORKDIR /app
  CMD ["python","-u","gateway_service.py"]

GitHub Actions (build & push) snippet


  name: Build & Push
  on: [push]
  jobs:
    build:
      runs-on: ubuntu-latest
      steps:
        - uses: actions/checkout@v4
        - name: Set up QEMU
          uses: docker/setup-qemu-action@v2
        - name: Buildx
          uses: docker/setup-buildx-action@v2
        - name: Login to Registry
          uses: docker/login-action@v2
          with:
            registry: ghcr.io
            username: ${{ github.actor }}
            password: ${{ secrets.GITHUB_TOKEN }}
        - name: Build and push
          uses: docker/build-push-action@v5
          with:
            platforms: linux/arm64,linux/amd64
            push: true
            tags: ghcr.io/org/gateway:latest

Deployment and staged rollout

Use balenaCloud, Mender, or your fleet manager to orchestrate rollouts. Staging pattern:

Push image to registry via CI.
Canary deploy to 1–2 gateways (smoke tests, telemetry checks).
If green, promote to 10% group, run regression, then full rollout.

Automate health checks: boot time, service heartbeats, inference output sanity checks. Roll back automatically on failure. For real-world micro-deployment patterns and tooling, see Micro Apps Case Studies.

Operational considerations: monitoring, costs, and scaling

Monitor both local health and sync health:

Local metrics: queue depth, disk usage, CPU/temperature (AI workloads raise temp), inference latency and confidence distributions.
Sync metrics: success rate, bytes transferred per window, retry counts, lag (seq gap).
Central metrics in ClickHouse: ingestion throughput, dedup counts, late-arriving events.

Bandwidth and cost controls:

Smart sampling: reduce GPS frequency when stationary.
Edge aggregation: send summaries (min/median/max) instead of raw streams.
Use satellite/cellular plan awareness: avoid heavy uploads on metered links.

Advanced patterns and future-proofing

Hybrid compute: run heavier model retraining in the cloud, but push distilled models to AI HAT+ 2 for inference. Keep model registry in Git/Git LFS or an OCI artifact registry.
Federated learning light: send aggregated gradients or model statistics (differentially private) rather than raw telemetry if privacy requires it.
Edge-to-edge routing: allow nearby gateways to peer for local sync when central connectivity is down—use a mesh (WireGuard + mDNS) and elect a sync leader.

Sample operational checklist

Provision device cert + initial model during manufacturing or first boot.
Start local broker and journal; ensure journal storage has watermarking.
Enable inference shim that writes model metadata alongside telemetry.
Configure sync agent with bandwidth policy and ClickHouse endpoint (mTLS credentials).
Deploy health checks and canary CI/CD workflows.

Actionable code snippets

Minimal sync logic (Python pseudocode):


  import sqlite3, gzip, requests

  def read_batch(conn, last_seq, limit=1000):
      cur = conn.cursor()
      cur.execute('SELECT seq, device_id, payload, model_version FROM telemetry_journal WHERE seq > ? ORDER BY seq LIMIT ?', (last_seq, limit))
      rows = cur.fetchall()
      return rows

  def make_payload(rows):
      for r in rows:
          yield {"device_id": r[1], "seq": r[0], "payload": r[2], "model_version": r[3]}

  def push_to_clickhouse(batch, endpoint, cert, key):
      data = '\n'.join(json.dumps(r) for r in batch).encode('utf-8')
      gz = gzip.compress(data)
      resp = requests.post(endpoint, data=gz, headers={'Content-Encoding':'gzip'}, cert=(cert, key))
      resp.raise_for_status()
      return True

2026 trends and predictions (short)

Edge AI hardware like AI HAT+ 2 becomes standard in fleet gateways — expect consistent reductions in inference TCO by 2027.
ClickHouse adoption continues to grow in telemetry analytics; design patterns for offline-first ingestion become first-class in analytics stacks.
Bandwidth-aware sync and privacy-preserving aggregation will be required by enterprises by 2026–2027 as regulatory and cost pressures increase.

Key takeaways (actionable)

Design for ordered, idempotent ingestion (device_id + seq) so delayed syncs never corrupt analytics.
Use local decisioning on AI HAT+ 2 to keep navigation safe and reduce cloud egress costs.
Batch, compress, and checkpoint aggressively—use zstd/gzip and adaptive windows.
Automate CI/CD and staged rollouts with multi-arch builds and canary groups to avoid mass failures.
Monitor both device and sync metrics and set hard limits for disk/queue size to prevent uncontrolled backlog growth.

Resilience is not just redundancy — it’s local intelligence, clear ordering, and predictable sync semantics.

Next steps / Call to action

Ready to build this architecture? Download our reference repo (includes Dockerfiles, a sync agent, ClickHouse schema, and GitHub Actions templates) and run the provided Pi image in test mode. If you need help designing a production rollout or tuning ClickHouse for high-cardinality fleet data, contact our engineering team for a workshop or scheduled audit.

Deploy smarter: start with the repo, run a 3-node Pi5 + AI HAT+ 2 test cluster, and simulate intermittent connectivity to validate your delayed-sync policies—then graduate to staged field rollouts.

webdecodes

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.