Storage Architecture: Tiered Ceph on Bare Metal
Overview

Storage Architecture: Tiered Ceph on Bare Metal

March 14, 2026
7 min read

Previous post covered the network — VLANs, isolated 10G storage, firewall policy. This post covers what runs on top of it: Ceph, managed by Rook-Ceph.

Why Ceph

There are several ways to handle storage for this cluster, and Ceph wasn’t the only option.

The simplest approach: keep the Synology NAS for file shares and run Longhorn inside the cluster for block storage. Or go with a dedicated TrueNAS box — it handles block (iSCSI), file (NFS/SMB), and object (MinIO S3) from one appliance. Both are valid, plenty of homelabs work this way.

Storage approach alternatives — external storage vs hyperconverged Ceph

But this is a hyperconverged build. The whole point of HCI is that compute, storage, and networking run on the same nodes. With external storage — whether Synology, TrueNAS, or anything else — almost every I/O from a pod crosses the network to a separate box and back. That’s a bottleneck, especially for latency-sensitive workloads like databases. With Ceph on the same nodes, the primary OSD for a PVC can live on the same node as the pod. I/O stays local or crosses one 10G hop at worst.

Ceph fits the HCI model — data distributed across the same nodes that run workloads, replicated by failure domain, self-healing. Rook-Ceph manages it as a Kubernetes operator, same lifecycle as everything else on the platform.

Ceph vs Longhorn

Longhorn is the other common Kubernetes-native storage option. Here’s where they differ:

Ceph (Rook)Longhorn
Block storageRBD (kernel driver)iSCSI (V1) / NVMe-oF (V2 preview)
Shared filesystem (RWX)CephFSNFS pod per volume (workaround)
Object storage (S3)RGWNot supported
NFS/SMB gatewayNFS-Ganesha + SambaNot supported
Erasure codingFull supportNot supported
RAM overhead (3 nodes)~12-18 GB~1-2 GB
CNCF statusGraduatedSandbox

Longhorn is simpler and lighter. But it only does block storage — no filesystem, no object, no erasure coding. For a project that needs all three storage types plus efficient capacity usage, it’s not enough.

And there’s the experience angle. I work with OpenShift daily, but at the platform level — I never touch the Ceph layer underneath. Building Rook-Ceph from scratch means understanding CRUSH maps, OSD lifecycle, PG management, erasure coding profiles — the internals that managed Ceph abstracts away. That knowledge directly benefits my professional work.

Three tiers per node

Each node has three storage devices:

TierMediaCapacityPurposePart of Ceph?
Boot2.5” SSD200-500 GBSCOS, etcd, local ephemeralNo
Fast OSDM.2 NVMe~500 GBBlock storage (databases, VMs)Yes — fast pool
Slow OSD3.5” HDD16+ TBObject, filesystem, backupsYes — slow pool
Warning (Boot drive endurance)

The boot drive has a specific requirement: high write endurance. etcd fsyncs on every Kubernetes API change — consumer SSDs with low TBW will wear out. Enterprise-grade endurance is required. Specific model is a BOM decision.

Storage tier layout within a single node

Two pools, two strategies

A database needs sub-millisecond latency. A media archive needs capacity. Mixing NVMe and HDD in one pool means Ceph might place a hot replica on spinning rust — 0.1ms becomes 10ms. Two pools, pinned to device classes via CRUSH rules.

Fast pool (NVMe) — replication-3

Three copies, one per failure domain. Simple, works cleanly with RBD. EC on NVMe isn’t worth it — small pool, minimal space savings, adds CPU overhead.

PhaseRawUsable (rep-3)
Phase 1 (3 nodes)1.5 TB~500 GB
Phase 2 (5 nodes)2.5 TB~833 GB

Not a lot, but block storage for databases and VMs doesn’t need terabytes.

Slow pool (HDD) — erasure coding

This is where the interesting decision is. The slow pool has the most raw capacity and benefits most from efficient replication.

Phase 1 (3 nodes) — two options:

StrategyUsableOverheadFailures toleratedRecovery headroom
EC 2/1~32 TB (67%)1.5x1None — no 4th node to rebuild on
Rep-3~16 TB (33%)3.0x2Full — any 2 of 3 nodes can rebuild

EC 2/1 doubles usable space but has no recovery headroom on exactly 3 nodes. If one node goes down, Ceph serves data fine but can’t rebuild the missing chunk. Second failure during that window = data loss. For media and backups that exist elsewhere, probably acceptable. For anything critical, it isn’t.

Note (Leaning EC 2/1)

The slow pool stores media, backups, and archives — data that can be re-obtained. But I haven’t committed yet and may change my mind during validation.

Phase 2 (5 nodes): EC 3/2 — 3 data + 2 parity. 60% usable, survives 2 failures.

PhaseStrategyRawUsable
Phase 1 (3 nodes)EC 2/148+ TB~32 TB
Phase 1 (3 nodes)Rep-348+ TB~16 TB
Phase 2 (5 nodes)EC 3/280+ TB~48 TB
Warning (EC migration)

EC profiles can’t be changed — can’t change after pool creation. EC 2/1 → EC 3/2 requires creating a new pool and migrating data. CephFS supports multi-pool layouts for this; RBD has live migration. Not trivial, but documented.

Ceph pool architecture — fast and slow pools with daemon placement

Ceph daemons

DaemonCountPurposeHA
MON3Cluster map, quorumSurvives 1 node loss
MGR2Metrics, dashboardActive + standby
MDS2CephFS namespaceActive + standby
RGW1S3 endpointScales later if needed
OSD2 per nodeOne NVMe + one HDD per nodePhase 1: 6, Phase 2: 10

StorageClasses

What workloads see:

StorageClassCeph poolAccess modeUse case
ceph-block-fastNVMe (rep-3)RWODatabases, VM disks
ceph-filesystemHDD (EC)RWXShared filesystems
ceph-objectHDD (EC)S3 APIBackups, media, archives

Workloads reference the StorageClass in their PVC. CRUSH rules, device classes, EC profiles — that’s infrastructure, not their problem.

What’s deferred

Note (Deferred decisions)
  • Drive models and capacities — BOM post
  • Boot drive endurance spec — BOM, real cost implications
  • Final EC 2/1 vs rep-3 decision — can wait until deployment
  • Rook-Ceph CRDs — implementation post
  • EC migration procedure — Phase 2 implementation
  • Ceph tuning — OSD memory, PG counts, scrub schedules. Day-2

Summary

Summary (Storage design at a glance)
  • Ceph via Rook-Ceph — fits the HCI model, covers block + filesystem + object from one cluster
  • Alternatives considered (Synology + Longhorn, TrueNAS) but don’t fit hyperconverged approach
  • Hands-on Ceph experience directly benefits professional work
  • Three tiers per node: boot SSD (local), fast NVMe (Ceph), slow HDD (Ceph)
  • Fast pool: NVMe, rep-3, ~500 GB usable in Phase 1
  • Slow pool: HDD, leaning EC 2/1 in Phase 1 (~32 TB), EC 3/2 in Phase 2 (~48 TB)
  • NFS/SMB gateway available if I want to replace the Synology later
  • Rook-Ceph as operator — CNCF Graduated, Kubernetes-native lifecycle

This is the last design subpost. Compute, network, and storage together define what the cluster needs. Next comes the BOM — where requirements become hardware with real prices.