Making the $250 Proxmox HA Cluster Hyperconverged

I previously setup a Proxmox high availability cluster on my $35 Dell Wyse 5060 thin clients. Now, I’m improving this cluster to make it hyperconverged. It’s a huge buzzword in the industry now, and basically, it combines storage and compute in the same nodes, with each node having some compute and some storage, and clustering both the storage and compute. In traditional clustering you have a storage system (SAN) and compute system (virtualization cluster / kubernetes / …), so merging the SAN into the compute nodes means all of the nodes are identical and network traffic is, in aggregate, going from all nodes to all nodes without a bottleneck between the compute and SAN nodes.

This is part of the Hyper-Converged Cluster Megaproject.

Video⌗

Here’s the video! Click on the thumbnail to view it

Cost Accounting⌗

Since the $250 number is important in the Youtube title, here’s the cost breakdown:

I spent $35 each on the thin clients. The seller had a 10% discount for buying 2 or more, so I bought one at full price followed by two more at $31.50 each.
I spent $25 each on the 8G ram sticks to bring each node up to 12G of RAM
I spent $16 on each 128G flash drive for my hyperconverged storage. I should note that these guys are pretty darn slow comapred to spinning rust even.

Ceph Install⌗

Step Zero, have a functional Proxmox cluster. You don’t need any storage or VMs, but at least have the nodes already physically connected, Proxmox installed, and joined.

Step One, go to one node and click the Ceph tab. It will tell you Ceph is not installed. Install the latest, as of the writing of this the latest is Pacific (16.2). You may have to hit enter when the console asks you if you’re sure you want to install, then wait for apt to do its magic.

Since this is our first node, we don’t have a Ceph configuration yet, so it will guide us through creating one.

Select your public and cluster (private) networks. I’m going to make a more detailed video on Ceph networking at some point hopefully, but the cluster network is used when Ceph communicates with itself (i.e. rebalancing, distributing copies of data between cluster members) and the public network is used when clients (including Proxmox’s RBD client) communicate with the Ceph cluster. Normally the cluster network will handle roughly twice as much traffic as the public network, since data replicated 3 times will be first sent over the public network to the first OSD, which will then send it to the other two OSDs via the cluster network. However, during any drive additions or removals, or when rebalancing is needed, the cluster network can potentially handle significantly more data. You can use a single network for both public and cluster if it is fast enough.

Also, select the node for the first monitor. I’m not sure why they bother asking, if you don’t have a cluster setup yet it must be the node you just installed Ceph onto, since you won’t have Ceph installed anywhere else.

Once you’ve finished the wizard on the first node, go to each other node, click Ceph, it’ll say Ceph is not installed, and install it. Once you finish with the apt install, it should say the cluster is already configured and not let you configure it again, so click to finish.

After everything is installed, you should have a configured cluster, with the monitor and manager installed on the node you chose first. Click on any node under Datacenter, go to Ceph -> Monitors, and create two more so there are 3 total in the cluster. If you have a bigger Proxmox cluster, spread them around a bit if possible, but you don’t need more than 3.

Ceph Manager and Dashboard⌗

Proxmox doesn’t install Ceph’s dashboard by default, and they have their own dashboard which shows an overview of the cluster, but I like Ceph’s dashboard a bit better. Plus, it’s much easier to manage more complex pool arrangements through Ceph’s GUI than using the command line.

So, we need to install it. The dashboard runs as part of the manager, so you only need to install it on nodes which are going to run the manager. You don’t need many managers, I only setup 2 in this (they are active/passive, not active/active).

The setup steps, from the shell of the Proxmox system:

Install the manager package with apt install ceph-mgr-dashboard
Enable the dashboard module with ceph mgr module enable dashboard
Create a self-signed certificate with ceph dashboard create-self-signed-cert
Create a password for the new admin user and store it to a file. Ceph is actually picky about password rules here. echo MyPassword1 > password.txt
Create a new admin user in the Ceph dashboard with ceph dashboard ac-user-create <name> -i password.txt administrator - ‘administrator’ is the role that Ceph has by default, so this user can then create more users through the dashboard
Delete the password file - rm password.txt
Restart the manager or disable and re-enable the dashboard (ceph mgr module disable dashboard and ceph mgr module enable dashboard). I rebooted the node here. The documentation suggests this shouldn’t be required.

As of 2024, create-self-signed-cert is no longer functional, however, it will return an error message telling you exactly what OpenSSL commands to execute to generate a self-signed cert, so you can follow those instructions.

Object Storage Daemons⌗

Ceph stores data a bit differently than a filesystem most of my viewers/readers are probably familiar with - the venerable ZFS.

In ZFS, you group disks into vdevs, where redundancy is handled at the vdev level. You can then designate certain vdevs with device classes, which are different from normal data disks. This includes L2ARC (second tier of cache), SLOG (dedicated fast space for the synchronous ZFS Intent Log), SPECIAL (separate devices to store metadata separately from file contents, to speed up access to directory listings and the like), and DDT (dedicated to storing deduplication tables). ZFS’s metadata stored on disk (or the special vdev) contains the vdev and sector on disk where the data is, so ZFS can start from the uberblock (which contains the pointer to the root of the metadata tree and is stored at several well-known locations on disk) and follow all of the vdev+sector addresses all the way down to individual blocks of files.

In Ceph, each disk is an OSD (Object Storage Daemon). The OSD should be backed by a single data disk, not hardware or software RAID and not a partition. Each data disk then has its own daemon which communicates with the Ceph monitor and other OSDs, so there’s an almost 1:1 correlation between the daemon and physical data disks. Reduncandy is done at a higher level, with Ceph computing which OSD should store data blocks. The Ceph system uses the ‘CRUSH Map’ to compute the location of data to the OSD level, so the client can communicate directly with the OSD hosting the data. However, within the OSD, the OSD needs to keep track of which sectors on its local disk it has put that data, and this goes in the database. The OSD can keep the database on the data disk, or we can optionally put the OSD’s database on a separate disk (guidance suggests allocating 2-4% the size of the disk depending on the workload) to speed up metadata access, similar to a SPECIAL devices in ZFS. Additionally, like ZFS, we can optionally give it a faster device to store synchronous writes to reduce the commit latency (although this does NOT improve throughput), known as the WAL (Write-Ahead Log). If you give the OSD a dedicated DB disk it will automatically use a small bit of that as the WAL, but if you don’t have space for a dedicated DB disk you can take a small partition of another disk to just have a dedicated WAL (or both, I guess).

Since redundancy is usually specified at the host level, it’s safe to assume that all disks within a single host could fail at the same time (unless you change the redundancy to be at the OSD level), therefore it’s safe to use a partition of a faster disk as the DB for multiple data disks. For example, with a 10TB spinning disk you would want ~400G of DB, so taking a 2TB NVMe SSD and partitioning it to use as the DB disk for 5x spinning disks would be reasonable. Failure of the DB would mean all 5 spinning disks would effectively be lost (since the data can’t be located on disk), but we already assumed host-level failure anyway. If you ARE using OSD level redundancy, then don’t use partitions for your DB disks.

The final note here is the device class. You can specify either HDD, SSD, or NVMe by default. This isn’t immediately useful, but later we can set rules specifying certain data pools must be located on a certain device class, so setting it correctly is always a good idea. This allows you to avoid creating multiple Ceph clusters in a mixed storage environment. For example, you could create a hyperconverged cluster with Proxmox, having a mix of NVMe and Sata SSD at each node, plus additional storage only nodes with a ton of HDDs. Your VM disks can be pushed to NVMe tiered with SSD, and CephFS can then use SSD tiered with HDD, all in the same cluster, like magic.

Now to actually set it up: First find the location of the disk (/dev/* path) and zap it: ceph-volume lvm zap /dev/sdX --destroy

Once the drive is zapped you can add it in Proxmox. Go to the node with the drive, go to Ceph -> OSD, click Create OSD, and it should find the unused disk.

Absolute Basics of Pools⌗

I’m just covering the absolute basics of pools here, what you can create in Proxmox’s storage pool dialog. There are of course many other ways to configure pools, and I may produce more content in the future on Ceph.

Create a pool in Proxmox by going to Ceph -> Pools, click Create, enter the information. Of note:

Name it whatever you want. Proxmox will use this name as the name of it’s corresponding storage.
PG Autoscale should usually be on. This dynamically resizes the number of placement groups (PGs) based on how much data is in the pool.
Size of 3 means the pool will try to have 3 copies of all data. Min size of 2 means the pool will allow write operations to complete when only 2 of the copies are completed, and allow continued operation on the pool when at least 2 of the copies are intact. This means you can never get into a scenario where losing a host causes loss of any data (as long as Ceph has completed the write), as the write is not completed until the data is on at least 2 hosts, and you can continue operating with at least one host down.
The CRUSH rule (replicated_rule) is the default, we will get into how to configure this in figure episodes of the series. But the CRUSH rule says things like which disk type and what the failure domain is (defualt host redundancy, and default any disk type). Proxmox has no GUI to configure this.