Running Production on a Raspberry Pi 5

There is a Raspberry Pi 5 on a shelf in my apartment running 36 Docker containers. It serves a React dashboard with SSO. It runs PostgreSQL with 227 migrations and 5 database roles. It hosts 76 automation workflows. It manages a semantic memory server, an MCP proxy layer, Home Assistant with presence detection, and a reverse proxy that handles TLS termination for three domains. It has been doing this for over a month with no unplanned downtime.

This is not a flex. It is a design choice with specific reasons, specific tradeoffs, and specific lessons that apply to infrastructure at any scale.

Why a Pi

The obvious question. The honest answer has three parts.

Cost. The Pi 5 was about $80. The entire compute layer (Pi plus a separate GPU node for inference) cost less than one month of the equivalent cloud bill. For a personal infrastructure project funded by disability income, this matters.

Constraint-driven design. 8GB of RAM and a quad-core ARM processor forces you to care about resource efficiency in ways that a 64GB cloud VM does not. Every container has CPU limits. Every query matters. Every background process that wastes cycles is a process you notice, because the headroom does not exist to ignore it. This produces better architecture. Not because constraints are romantic, but because they make waste visible.

Portability proof. If the architecture runs on a Pi, it runs anywhere. Docker Compose on ARM is the same Docker Compose on x86. The migrations work on any PostgreSQL instance. The reverse proxy config ports to any Caddy deployment. Building on the most constrained hardware proves that the patterns are not dependent on generous resources. When the system moves to cloud (and it will, for professional workloads), the migration is configuration, not redesign.

The container topology

36 containers across 5 isolated Docker networks. This is not 36 random services thrown into a single bridge network. The network segmentation is intentional.

Frontend network. Caddy (reverse proxy), the dashboard API, the dashboard web server. These are the containers that face the internet. Caddy terminates TLS and routes to Authelia for SSO before anything reaches a backend service.

Backend network. PostgreSQL, Redis, n8n, the memory server, MCP proxies. These containers talk to each other but are not directly reachable from the frontend network except through Caddy.

Auth network. Authelia and its dependencies. Isolated because the authentication layer should not be on the same network as the services it protects.

Automation network. n8n workers, webhook receivers, cron-triggered pipelines. These handle the async work: email classification, calendar sync, health data processing, system state collection.

Home automation network. Home Assistant, Wyoming voice pipeline services, satellite device communication. This network bridges to the physical world: Sonos speakers, Ecobee thermostat, Dyson air purifier, smart lighting.

The networks overlap where they need to. PostgreSQL is reachable from the backend and automation networks. Caddy can route to any backend. But a compromised automation container cannot reach the auth layer, and a misconfigured Home Assistant integration cannot touch the database directly.

This is the same network segmentation pattern that production Kubernetes deployments use with network policies. The tooling is simpler (Docker Compose network declarations vs. Calico/Cilium), but the principle is identical: least-privilege network access.

The database

PostgreSQL 16 with 227 sequential migrations tracked in a schema_migrations table. A custom migration runner checks which migrations have been applied and runs only the new ones. Every migration is idempotent.

Five database roles:

Superuser (administrative DDL, never used by applications)
App role (read-write for application queries, can access all schemas including sensitive ones)
Readonly role (reporting, dashboard API queries, MCP tool access)
Remote read-write role (scoped access for external MCP connections, no DELETE, no DDL, no access to sensitive schemas)
Home Assistant role (recorder access only, isolated to HA tables)

This is not theoretical. The privacy tier model has real enforcement. There is an isolated schema for sensitive data that only the app role can see. The readonly and remote roles do not know it exists. When external AI tools query the database through MCP, they literally cannot access the protected schema. Not “should not.” Cannot.

The three-tier data classification (Sacred, Sensitive, Convenience) determines where data lives, which roles can access it, and which processing paths are allowed. Sacred data never leaves the local network. Sensitive data requires encrypted transport and proper authentication. Convenience data can use cloud services.

This maps directly to enterprise data classification frameworks. The difference is that it was implemented from the beginning, not bolted on after an audit finding.

CPU limits and resource management

Every container has explicit CPU limits in the Docker Compose file. This is the single most important operational decision on constrained hardware.

Without CPU limits, a single misbehaving container can starve everything else. I learned this the specific way: an n8n workflow with a runaway loop consumed 100% of available CPU for 3 minutes before I caught it. During those 3 minutes, the dashboard was unresponsive, PostgreSQL queries were timing out, and Home Assistant stopped processing automations.

After that incident, every container got limits. PostgreSQL gets the most (it earns it). n8n workflows get moderate allocation. Low-priority services (document caching, health data sync) get minimal shares. The system can now handle a runaway process without cascading failure, because the misbehaving container hits its ceiling and slows down instead of dragging everything else with it.

This is basic cgroup enforcement, nothing novel. But the discipline of actually setting limits on every container, rather than running wide open and hoping for the best, is something I have seen absent in production environments with significantly more resources.

The reverse proxy layer

Caddy handles TLS termination and routing for three domains:

ataraxis.cloud: the dashboard and API (auth-gated via Authelia SSO)
mcp-*.ataraxis.cloud: MCP proxy endpoints (dual auth: OAuth 2.1 for claude.ai, API key for desktop tools)
ha.ataraxis.cloud: Home Assistant access through Cloudflare Tunnel

Caddy was chosen over Nginx for a specific reason: automatic TLS certificate management. On a Pi with limited operational bandwidth (meaning: I am the entire ops team), not having to manage certificate renewal is worth the slight performance difference. Caddy handles ACME challenges, stores certs, and rotates them. I have never manually touched a certificate.

Authelia sits behind Caddy for SSO. One login covers the dashboard, Home Assistant, and any future service. OIDC clients allow native apps (an iOS app I am building) to authenticate against the same identity provider.

The MCP proxy layer is more involved. Model Context Protocol is how AI tools connect to external data sources. I built a custom proxy that accepts authenticated connections and routes them to internal services: database queries, n8n workflow management, memory server operations, Home Assistant control. The auth layer supports both OAuth 2.1 (for cloud-hosted AI like claude.ai) and API keys (for local desktop tools). Cloudflare Workers handle the OAuth flow at the edge.

Backup strategy

Hourly PostgreSQL dumps to local storage. Encrypted sync to Cloudflare R2 for offsite backup with 7-day retention. Zero egress costs (R2 does not charge for egress).

The backup is tested. I have restored from backup to verify the process works. An untested backup is a hypothesis, not a safety net.

WAL (Write-Ahead Log) files are cleaned up on a schedule to prevent disk fill. The Pi has a 256GB SD card (yes, SD card, and yes, I know), and PostgreSQL WAL accumulation was the first disk pressure issue I hit. Automated cleanup solved it.

The SD card is the most fragile component in the system. The mitigation is that the system is designed to be rebuildable. Docker images are pinned to specific versions (never :latest). The Compose files, migrations, and configuration are all in git. If the SD card dies tomorrow, a fresh Pi with a git clone and docker compose up gets most of the system back. The database restore from R2 fills in the data. Recovery time objective: under an hour, tested.

Monitoring (and the gaps)

The System State Registry tracks 234 properties across all services. Categories include database health, container status, network connectivity, disk usage, automation pipeline state, and external service availability. A drift detection system compares current state against expected baselines and flags anomalies.

This is custom-built. It runs as an n8n workflow that collects metrics on a schedule and writes them to PostgreSQL. A dashboard page surfaces the results.

What it is not: Prometheus. Grafana. PagerDuty. Structured alerting with escalation policies and SLA tracking.

For personal infrastructure, the System State Registry is sufficient. I check the dashboard, I see the drift count, I investigate if something is flagged. For anything beyond a single operator, this would need real observability tooling. I know this. The architecture has room for it (Prometheus can scrape Docker metrics, Grafana can query PostgreSQL), but the current setup reflects the actual operational model: one person, checking a dashboard, with automated drift detection as the early warning system.

What breaks

Things break. Pretending otherwise would undermine the credibility of everything else in this essay.

Memory pressure. 36 containers on 8GB of RAM means the system runs at 70-85% memory utilization in steady state. A spike (large data import, multiple concurrent n8n workflows, a complex PostgreSQL query) can push into swap. Swap on an SD card is slow enough that you notice. The fix is careful scheduling: heavy data imports run at low-traffic times, and the most memory-hungry operations are serialized rather than parallelized.

SD card I/O. This is the bottleneck. PostgreSQL write-heavy workloads are slower than they would be on SSD. The WAL writes, the vacuum operations, the index rebuilds: all constrained by SD card throughput. For read-heavy dashboard queries, this is fine (the data fits in PostgreSQL’s buffer cache). For bulk imports, it matters. A future upgrade to USB-attached SSD storage would meaningfully improve write performance.

Thermal throttling. The Pi 5 runs warm under sustained load. I have a heat sink and fan, but extended periods of high CPU utilization (a full test suite run, a large migration batch) can trigger thermal throttling. The practical impact is that heavy operations take 10-15% longer than they would at stable temperatures. Not catastrophic. Noticeable.

ARM compatibility. Most Docker images have ARM builds. Some do not. I have hit this twice in 35 days: a tool that only published amd64 images, and a Python package with a binary dependency that did not have ARM wheels. The workarounds are either building from source, finding an alternative, or running the specific workload on the GPU node (which is x86). This is becoming less common as ARM adoption grows, but it is not zero.

What this proves about cloud readiness

The concern that a potential employer or client might have: “This runs on a Pi, how does that translate to real infrastructure?”

The answer is that everything about this architecture is portable.

Docker Compose maps to ECS, Cloud Run, or Kubernetes manifests with configuration changes, not architectural changes. PostgreSQL with proper migrations and role separation works identically on RDS, Cloud SQL, or any managed database. Caddy’s reverse proxy configuration ports to any environment (or gets replaced by an ALB). Cloudflare Workers already run at the edge. The n8n workflows are JSON definitions that import anywhere n8n runs.

The Pi is the constraint that proves the design. If the system handles 36 containers on 8GB of RAM with proper network segmentation, CPU limits, backup strategy, and monitoring, then the same system on a cloud VM with 32GB of RAM and NVMe storage is not a question of whether it works. It is a question of how much faster.

The real skill being demonstrated is not “running things on a Pi.” It is the operational discipline: network isolation, least-privilege access, migration tracking, automated backup with tested restore, resource limits, monitoring with drift detection. These practices do not change based on the hardware they run on. They are the same practices that separate production infrastructure from “it works on my machine.”

The lesson

The Pi taught me something I would not have learned on a generous cloud VM: you cannot hide from your own bad decisions when the resources are tight. A lazy query, an unbounded log, a container without CPU limits, a background process that polls too aggressively: on a 64-core cloud instance, these are invisible. On a Pi 5, they are the difference between a responsive system and a swap-thrashing mess.

Every optimization I made because the Pi demanded it is an optimization that would improve any infrastructure. The constraint was the teacher. The patterns are the product.

If you are building infrastructure or if you need someone who can build it under real constraints, not theoretical ones, I would like to hear from you.