I’ve been having some issues with my home Kubernetes cluster lately. One of my servers has been consistently triggering a kernel panic on reboot, one is constantly OOM killing processes, and my rack sucks an enormous amount of power at idle.
I have also been neglecting my updates maintenance over the past year or so. I use a combination of Renovate, GitHub, and Flux to manage automatically deploying updates to the services I host, but breaking changes require manual intervention. Rather than trying to update my existing stack, I’ve decided to replace (most of) it.
Project scope
Before getting into the fun technical details, I first need to identify what I want to accomplish, and what my constraints are. In no particular order, I want to:
- Reduce my monthly power costs spent on compute. As mentioned above, my power costs are huge and they are only getting higher. There was a (roughly) 5% increase in power costs in my area this year, and due to local/state politics I am willing to bet that there will be another, larger increase in the next year or two.
I won’t say that I want to reduce all the power costs associated with my rack, as I don’t want to include replacing my 60+ TB of storage or network in the scope of this project. However if I can eventually spin down my existing Kubernetes nodes then my costs should be reduced significantly. - Improve the stability of the services I host. Primarily this means addressing the kernel panics and using nodes with more RAM. I’d also like to limit the single points of failure that some of my services have, such as Jellyfin. A few services can only be scheduled on a specific node due to hardware requirements (such as a GPU for transcoding media).
- Improve upon the architecture of my existing cluster. I’ve learned a lot while running my current iteration of my home infrastructure for the past few years, and there are a handful of things that I’d like to change but can’t without several major overhauls. However I really like the general design (Kubernetes platform with GitOps workflows) and intend on continuing to use the approach.
- Continue to learn and develop my skillset. I believe that practicing skills and using knowledge is critical to retaining them. I also believe that work on my homelab has a hugely positive impact on my personal career growth, and my value as an employee to my employer. I wouldn’t have my current job without (in part) what I’ve worked on in my free time. Tangentially related, one of my work benefits provides a large stipend that can be used for this project specifically because it makes me better at what I do.
I have several constraints that affect the design decisions for this project:
- I want to keep the total cost under $5000. This includes all hardware, software (if needed), but not the value of my time. This might sound like a lot, but as outlined below, it gets eaten up pretty quickly.
- The hardware needs to be quiet. I would like to be able to keep it in my office, at the 80+ dB noise of my server rack isn’t remotely close to being reasonable for this. Unfortunately I don’t really know for certain what a reasonable/tolerable noise level is, but I think under 40 dB at idle should be fine.
- The end result needs to be physically small and look decent. I don’t want a stack of mid-tower cases, or a 42U server rack in my office.
- I need to be able to run most of my current workloads on it. The biggest things that this impact are node processor ISA, total cluster CPU, total cluster RAM, in-cluster storage, and inter-node network bandwidth. I think I can get away with ARM64 nodes while only losing my game servers. RAM is a bit tougher - I typically need about 80 GB of RAM, although the 25 GB of RAM used by my game servers can be excluded.
My preferred choice in cluster storage interface (rook-ceph) places some heavy requirements on the hardware, typically using 8 GB of RAM per drive. To be performant it also needs SSDs with power loss protection (PLP allows for sync writes to be performed asynchronously, see this link for some benchmarks), and a fat network link between nodes to sync drives distributed across failure zones.
With these requirements, I think I need to meet the following specs at bare minimum:- Three nodes (storage redundancy with a one node loss before in-cluster storage becomes read only).
- 24 GB of RAM per node. This will provide somewhere around 16 GB of RAM for applications.
- One SATA Ⅲ or NVMe slot with 8 GT/s cumulative bandwidth per node. This is for in-cluster storage. If NVMe, then it needs to be a 22110 slot due to the limited number of 2280 NVMe SSDs with PLP.
- A separate drive for local storage. This will contain the OS, container images, ephemeral container space, etc.
- A 5 Gbps full-duplex network connection between each node. This is is also for in-cluster storage.
- Hardware transcoding on two or more nodes.
- A cumulative 5 Gbps link between the cluster and the rest of my network. This is estimated based off of peak usage of my current cluster, which is largely comprised of downloading data from the Internet while also streaming media from my NAS to Jellyfin, to two clients.
New cluster, new hardware
I’ve been researching options for the nodes for a few months now. Most machines with a x86 processor eat a ton of power, and those that don’t are either very slow or very expensive compared to other options. There are some options on eBay for cheap “micro” PCs such as the Dell Optiplex series, but none of the options I saw could be retrofitted with a 5Gbps (or faster) NIC, 32 GB of RAM, and two drives that meet my requirements. Used servers are also not an option due to size, aesthetic, and noise requirements. A stack of laptops is also not terribly pleasing to look at, and there are few that would meet my other requirements. This primarily leaves ARM64 single board computers (SBCs).
ARM processors have historically trailed very far behind x86 processors in raw performance. However recent advances in ARM64 CPUs actually beat high end x86 processors in some workloads in some sectors. While purchasing an AWS Outposts rack with Graviton processors is not remotely feasible for several reasons, there is a correlation between these advances and advances in ARM64 processors used in other applications.
The Rockchip RK3588 System on Chip (SoC) is the latest and greatest in economical SBC processors. While there are other, faster options, they are either only available in expensive, locked-down products, or are ultra-expensive and specialized1. The RK3588 SoC touts four A76 cores, four A55 cores, has a GPU with transcoding support, quad-channel RAM, two 2.5Gbps MACs, and several PCIe and SATA Ⅲ busses. While the SoC doesn’t inherently address most of my requirements, it does at least support them.
I found several options for RK3588-based SBCs. Here’s a comparison I made of their features2:
Name | Price | RAM | NICs | M.2 Slots | SATA III | eMMC |
---|---|---|---|---|---|---|
ROCK Pi 5B | $189 | 16 GB LPDDR4 | 1x 2.5GBase-T | 1x M key PCIe 3.0 x4, 1x E key PCIe 2.1 x1 |
1x via M.2 E key slot | Proprietary socket, not included |
DB3588V2 | $229, minimum, very unclear | 32 GB LPDDR4 | 2x 1000Base-T | 1x M key PCIe 3.0 x2 | 2x standard port | 256 GB eMMC 5.1 |
BPI-RK3588 | $160 | 8 GB LPDDR4 | 1x 2.5GBase-T | 1x M key PCIe 3.0 x4, 1x E key PCIe 2.1 x1 |
1x via M.2 E key slot | 32 GB |
ROC-RK3588-RT | $319 | 32 GB LPDDR5, very unclear | 1x 2.5GBase-T, 2x 1000Base-T |
2x, one E key, very unclear | 1x via M.2 E key slot | 128 GB |
Orange Pi 5 Plus | $189, presale price | 32 GB LPDDR4X | 2x 2.5GBase-T | 1x M key PCIe 3.0 x4, 1x E key PCIe 2.1 x1 |
None | Proprietary socket, not included |
Turing RK13 | $260 | 32 GB LPDDR4 | 1x 802.3ab MDI 4 | 1x M key PCIe 3.0 x4 | 2x standard port, only on node 3 | 32 GB eMMC 5.1 |
Blade 35 | $439 | 32 GB LPDDR4 | 2x 2.5GBase-T, 1x 16 Gbps custom TCP/IP over PCIe implementation |
1x M key PCIe 3.0 x2 | 1x standard port | 256 GB eMMC 5.1 |
Every option has it’s pros and cons. In the end, two options stood out to me: the Turing RK1 and the Blade 3. Both of these boards can attach to a larger cluster board (optional for the Blade 3) for additional features. This results in a much cleaner, less “jury-rigged” end result than a bunch of bare boards stacked up with wires coming out and running everywhere. From a more technical perspective, both options have a basic baseboard management controller, which supports things like switching the nodes on and off, and remotely accessing the nodes as if physically connected to them. The key differentiator between the two options is that the Blade 3 has a really cool feature when installed in a Cluster Box, the Blade 3’s clustering solution. When installed, the Blade 3 nodes can communicate via a network that uses PCIe + software for layer 1 of the stack, instead of 802.3 Ethernet over copper/fiber:
Unfortunately this does have some drawbacks, which I’ll cover later.
Despite the high price (compared to the other options), the Blade 3 is the clear winner for my specific use case. The Turing RK1 comes close to matching it in features, but the network link between nodes is just too slow. Additionally, the lack of SATA port for each node is a major drawback. If I didn’t have this requirement then I probably would have used the Turing RK1, or the Orange Pi 5 Plus for my nodes. Instead, I’m moving forward with four Blade 3s installed in a Cluster Box.
An interesting alternative
One other option that I saw was the Zimablade. Cringy website aside, it actually meets most of my requirements. With this option I could lax my RAM requirements by purchasing additional semi-dedicated nodes for in-cluster storage due to the low price. However, this does have drawbacks in that the out of the box form factor is awful for a cluster setup with a full PCIe card installed. Making these look halfway decent would require designing and manufacturing an enclosure for them. I may revisit these at some point in the future if I want some x86 nodes and/or nodes with Intel Quick Sync, but for now they are not an option.
Undocumented (until now) Blade 3 information
While the Blade 3 has significantly better documentation than most of the other options I evaluated, there is still some technical product information that is (to my knowledge) not yet documented anywhere online. Prior to placing my order, I reached out to the company for clarification on a few things. Here’s what I asked, and what Mixtile told me:
- What is the highest throughput way to connect one Cluster Box network to another, or an external network? - Connect the a switch with LACP support to the two 2.5Gb Ethernet ports of any Blade 3 in the Cluster Box. The highest throughput is 5Gbps. The Ethernet port of the control board (the RJ-45 jack on the outside of the Cluster Box) is only for config Blades and not for clustering.
- Could multiple Cluster Boxes be daisy chained via these SFF-8643 PCIe interfaces, to establish a network that spans all attached Cluster Boxes? - Multiple Cluster Boxes cannot be daisy-chained via these SFF-8643 PCIe interfaces because all the SFF-8643 ports are connecting to the upstream of the PCIe Switch chip.
- What network topology does the Cluster Box use? - The Cluster Box has a built-in PCIe Switcher and the topology of PCIe is point-to-point. There are four nodes in the Cluster Box and they are connected PCIe Switcher. We developed a TCP/IP over PCIe interface and the whole network topology of the cluster is star topology.
- What layer 2 features does the switch support? Does it support VLAN tagging, switching, and trunking? What about QoS? SNMP? Jumbo frames? - Yes, the switch supports the following layer 2 features: MAC Address Collision Detection, and switch functions such as VLAN, and QoS. It supports what OpenWRT supports.
- What layer 3 features does the switch support (if any)? IPv4/IPv6 static and/or dynamic routing? BGP or OSPF support? BFD support (if BGP is supported)? Firewall support of any kind? - Yes, the switch supports the following layer 3 features: IPv4/IPv6, static and/or dynamic routing, basic NAT-based firewall. BGP or OSPF depends more on the firmware, the current OpenWrt version installed doesn’t support those yet but we plan to provide BGP and BFD support in the following months.
- Is the switch capable of providing other network services (DHCP, PXE boot, DNS forwarding, NTP, etc.)? - The switch is capable of providing these network services: DHCP, PXE boot, DNS forwarding, and NTP. It can provide the services that OpenWRT supports.
- Can the switch OS (OpenWRT as I understand it) be replaced? If so, how is this done, and what recovery options are there? - The OpenWrt OS cannot be replaced with another OS. We will provide the source code of OpenWrt OS and the user can customize the OpenWrt system. Then user could use Luci to update the firmware.
- Do the network interfaces share bandwidth with the M.2 slots and other PCIe interfaces? How can I connect the PCIe switch network to an external network? - Yes, network interfaces share bandwidth with M.2 slots. Each Blade 3 has four lanes of PCIe 3.0, when M.2 SSD is connected, two lanes of PCIe will be used by M.2 SSD so it leaves two lanes of PCIe for the network interfaces.
- Can the M.2 slot and SATA 3.0 port for each Blade be used at the same time? In other words, can I install a M.2 drive and attach a SATA disk at the same time on every Blade? - Yes, the M.2 slot and SATA 3.0 port could be used at the same time. SATA 3.0 port is independent and not related to the PCIe interface.
- How long are the M.2 slots for each Blade in the Cluster Box? - The M.2 slots support standard 2280{80mm(L)*20mm(W)} size NVMe SSD.
- How are the SFF-8643 ports connected to the Blades and control board? Are they root complex ports? Could I use them to connect to something like a U.2 SSD, or a network card? I’m a little unclear on how I could use these or what I could use them for. - SFF-8643 ports connect to the upstream of the PCIe switcher and they are root complex ports. So it cannot support U.2 SSD or other network cards. It could be connected to other computers and used other other computers to work as the PCIe root complex of Cluster Box.
- What practical applications do the SFF-8643 PCIe ports have? Say for example that I used one of these to connect a Cluster Box to a PCIe port in a desktop or typical server host computer. How would the host computer communicate with these devices (Blades and control board)? Would this establish a TCP/IP over PCIe network connection between the host computer and the Cluster Box? - The SFF-8643 ports are used to connect desktop computers. The user will need a PCIe to SFF-8643 Adapter like this:
- Can the Cluster Box be used for remote out of band management (turning Blades on and off, communication via a console port or similar)? - Yes, you could use SSH to remote login the OpenWrt and then use the console to control all nodes.
- Can multiple Cluster Boxes be connected together with a high speed interface (network or maybe SFF-8643 PCIe)? - Yes, multiple Cluster Boxes can be connected together. You could use an external PCIe switcher or network switcher for this.
- You mentioned that an “external PCIe switcher” could be used to connect multiple Cluster Boxes. Can you send me a link to where I could purchase such a device? I’m not very familiar with this concept, and a quick Internet search isn’t turning up very much. - There is no suitable external PCIe switcher in the market yet but it could be customized (designed, manufactured, programmed, supported) by the user.
- Is the bootloader firmware source code for the Blades available? - We are preparing the firmware source code of Blade 3 and will update our GitHub in next week. Note: this was in an email on November 3rd, 2023, and their GitHub organization as of December 20th, 2023 has had no public commits since April 24, 2023.
- Do the Blades require any software (such as kernel modules, drivers, device trees, management software) that is not included in the Linux source tree? If so, where can I get this software? - Yes, Blade 3 requires drivers for this function. Instructions are available here. Note: per this post on the manufacturer’s forum, the miop-control driver deb archive is not open source. On the bright side, I pulled the kernel modules into Ghidra and they are both small and (appear) logically simple. I think if I pulled in the header files for the kernel they were compiled with then they could reverse engineered and rewritten with a few hours of work.
- Is there any additional documentation available on how to setup and use the Cluster Box? - You can refer to our website for setup instructions here.
- Can the mini-PCIe socket on each Blade be used while the Blades are installed in a Cluster Box? - Yes, the mini-PCIe sockets are available when the Blades are installed in a Cluster Box.
- The blade documentation mentions a rackmount solution supporting up to 75 blades in a 2U chassis. Where can I find more information about this? - There is no more information yet and we plan to disclose some information about that at the beginning of next year.
Note that some of the questions and responses have been minorly altered (formatting, context, removing unrelated/irrelevant information) where appropriate.
Some documented, harder to find information
Somebody else evaluating the the Blade 3 might find some of this useful, although it is currently documented elsewhere:
- While the board does have both eMMC and a MicroSD card slot, I think the RK3588 datasheet (SoC itself, not the board) only supports having one connected at a time.
- The board uses a SFF-8639 connector for power (and data). The block diagram lists that only the 12V rail is used. Per the SFF-8639 spec, the max continuous current per power pin on the connector is 1.5A, and there are 12V rail pins. Assuming that the board follows the SFF-8639 spec, the max power draw of the board should be 18W/pin * 3 pins = 54W. In reality, the max power draw is likely significantly less than this. 54W is just the spec’s technical limit.
- The power adapter provided with the box does not have an easily-visible UL mark, but it does have a CE mark.
Blade 3 drawbacks
I’ve covered just about everything I’ve researched regarding the Mixtile Blade 3 and Cluster Box. I’ve outlined neat features and benefits of the products, so now I’d like to share some of the downsides that I’ve noticed.
- The Cluster Box external PCIe ports (SFF-8644 connector) are borderline useless. They have to connect to a root complex, which is typically something you’d find as a part of another CPU rather than something like a SSD or expansion card. They also require what appears to be a yet to be released driver to work, even in this capacity.
- The Cluster Box external PCIe ports use a really weird connector for an external connection. SFF-8643 is typically used for internal connections, with something like a SFF-8644 for external connections. An even better option for connecting to modern (last couple of generations) computers would be to use PCIe via Thunderbolt.
- There is no external interface to the really cool TCP/IP over PCIe backplane. In my opinion this is probably the biggest drawback of the Cluster Box product. To send packets from an external device over this network, they must first be relayed through one of the Blades. Doing this without a single point of failure makes for a really weird network topology, and requires additional complexity and hardware. This also means that external hosts are probably limited to 5Gbps when communicating with any given node. That being said, I have some ideas about how traffic could be load balanced over the external 2.5GBase-T NICs and the internal TCP/IP over PCIe NIC via the other hosts.
If Mixtile (or somebody else) ever comes out with a SFP+ NIC with a root complex that can connect to one of those ports then I’ll buy it in an instance. I would have gladly paid another $100 for this problem to be solved out of the box. /rant - The four PCIe 3.0 lanes are split between the network backplane and the M.2 slot. As I understand it, the peak bandwidth that either can achieve is the bandwidth of two lanes. It would be really nice if the PCIe switch would load balance the traffic for each device so that when one is heavily utilized and the other is not, then the heavily utilized device can use more than two lanes worth of bandwidth.
- No external SATA power port. Something I didn’t realize until I started looking at external pass-through enclosures for SATA drives is that a majority of them require SATA power rather than AC or a DC barrel jack. Not having this means more cables and more mess. This isn’t a huge deal, but it would have been really nice to have.
- The cost for the Blade 3 nodes is really high compared to other options. The markup for the 32GB model is enormous when compared to the 4 GB model, costing $210 for 12 GB of LPDDR4. I know high density chips are expensive, but $17.5/GB is extremely high. This is almost double the cost per GB of high end, ultra dense DDR5 server DIMMs. This is largely the reason for my previous comment about the Turing RK1 being my choice if I did not have the network bandwidth requirement.
Note that at the time of writing I have the products in hand, but I have not even powered them on yet. I may find additional issues as I begin working with them, or I may find that some of the problems I currently think exist, don’t.
Wrapping up
While the Mixtile products don’t appear perfect, they seem pretty great and I’m excited to move forward with them. I’ve already ordered a fully loaded Cluster Box and have it in hand, and unboxed. In the next post I’ll share the photos I’ve taken and list a couple of things that I’ve noticed that aren’t really obvious until you hold the units in your hand.
-
The last time I checked, Zync MPSoCs and RFSoCs cost 5 figures for the chip alone. ↩
-
I found a handful of other options but did not include them due to lack of pricing and/or information. ↩
-
The specs for this are listed as if was installed in a yet to be released Turing Pi 2. This item alone is just a System on Module (SoM)! ↩
-
This is basically a fancy way of saying 1000Base-T but with a board-to-board connector. ↩
-
The specs for this are listed as if installed in a Cluster Box. There are other configurations available if using a Breakout Board or Blade 3 Case. ↩