Tales of the Homelab I: Moving is fun

2026-01-23

I love my homelab. It is an amalgamation of random machines both efficient and not, hosted and not, pretty janky overall. A homelab reflects a lot about what kind of operator you are. It’s a hobby, and we all come from different backgrounds with different interests.

Some like to replace applications when Google kills them, some like to tinker and nerd out about performance, others like to build applications. I like to own my data, kid myself into believing it’s cheaper (it isn’t, electricity and hardware ain’t cheap, y’all), and I like to just build stuff, if that wasn’t apparent from the previous post.

A homelab is a term that isn’t clearly defined. To me, it’s basically the meme:

Web: here is the cloud Hobbyist: cloud at home

It can be anything from a Raspberry Pi, to an old Lenovo ThinkPad, to a full-scale rack with enterprise gear and often several of those states exist at the same time.

My homelab is definitely in that state: various Raspberry Pis, mini PCs, old workstations, network gear, etc. I basically have two sides to my homelab. One is my media / home-related stuff; the other is my software brain, with PCs running Kubernetes, Docker, this blog, and so on.

It all started with one of my mini PCs. It has a few NVMe drives and runs Proxmox (basically a virtual machine hypervisor datacenter at home). It runs:

Home Assistant, where it all started I needed an upgrade from running it on a Raspberry Pi
MinIO (S3 server)
Vault (secrets provider)
Drone (CI runner)
Harbor...
Renova...
Zitadel...
Todo...
Blo...
Gi...
P...

In total: 19 VMs.

You might be saying and I don’t want to hear it that this is simply too many. A big, glaring single point of failure. Foreshadowing, right there.

My other nodes run highly available Kubernetes with replicated storage and so on. They do, however, depend on the central node for database and secrets.

Moving

So, I was moving, and a little bit stressed because I was starting a new job at the same time (day, idiot). I basically packed everything into boxes / the back of my car and moved it.

It took about a week before I got around to setting up my central mini PC again, as I simply began to miss my Jellyfin media center filled with legally procured media, I assure you.

I didn’t think too much of it. Plugged it in on top of a kitchen counter, heard it spin up... and nothing came online. I’ve got monitoring for all my services, and none of it resolved. Curious.

I grabbed a spare screen and plugged it in.

systemd zfs-import.want: zfs pool unable to mount zfs-clank-pool

Hmm. Very much hmm. Smells like hardware failure, but no panic yet.

I had an SSD in the box the one used for all the VM volumes. I’d noticed it had been a little loose before, but it hadn’t been a problem. The enclosure is meant for a full HDD, not a smaller SSD.

I tried reseating the SSD. No luck.

Slightly panicky now, I found another PC and plugged the SSD into that to check whether it was just the internal connector.

Nope. Nope. Dead SSD. Absolutely dead.

The box wouldn’t boot without the ZFS pool, so I needed a way to stop that from happening. Using live boot Linux usb, I could disable the ZFS import and reboot.

The Proxmox UI, however, was a bloodbath.

0/19 VMs running. F@ck.

As it turns out, there’s sometimes a reason we do the contingencies we do professionally high availability setups, 3-2-1 backup strategies, etc. Even though my services had enjoyed ~99% uptime until then, the single point of failure struck, leaving a lot of damage.

The way I had designed my VM installations was by using a separate boot drive and volume drive. This is a feature of KVM / Proxmox and allows sharing a base OS boot disk while separating actual data. It’s quite convenient and keeps VMs slim.

My Debian base image was about 20 GB. That would’ve been 20 GB × 19 VMs. Not terrible and honestly, I would’ve paid that cost if I’d been paying attention.

Instead, I was left with VMs that wouldn’t boot because their boot disk was gone. Like a head without a body. A dog without a bone.

After a brief panic actually quite brief I checked what mattered first: backups. And yes, the important things (code in Gitea, family data) were all backed up and available. I should’ve tested my contingencies better, but at least monitoring worked.

I restored the most important services on one of my old workstations that I use for development.

I did have backups of the VMs... but they were backed up to the same extra drive that had failed.

That was dumb.

However, I had a theory. I could replace the missing boot disks with new ones and reattach them to the existing VM data disks. Basically, give the dog its bone back.

It was not fun but I managed to restore Matrix, Home Assistant, this blog, Drone, PostgreSQL, and Gitea. Those were the ones I cared about most and that were actually recoverable. The rest had their data living exclusively on the dead disk.

I may or may not share how I fixed it. It’s been a while, and I’d have to reconstruct all the steps. So probably not.

At this point, my Kubernetes cluster was basically borked (if you know, you know). All the data was there, but none of the services worked most of them depended on secrets from Vault, which was gone.

So I had to start over. Pretty much.

It wasn’t a huge loss, though. All my data lived in Postgres backups, and all configuration was stored GitOps-style in Gitea.

Postmortem

I never fully restored all the VMs and that’s fine. I could have, but this was also a good opportunity to improve my setup and finally move more things into highly available compute. It was also a chance to replace components I wasn’t happy with. Basically the eternal cycle of a homelab.

Harbor was one of them. It’s heavy and fragile. Basically, all my Java services had to go. Not because I hate Java but because they’re often far too resource-intensive for a homelab running on mini PCs. I can’t have services consuming all RAM and CPU for very little benefit.

Since then, I’ve significantly improved my backup setup. I now use proper mirrored RAID setups on my workstations for both workloads and backups, plus an offsite backup.

Fun fact: as I was building my new backup setup, I had another of these SSDs fail on me. That is 2/3 of my EVO Samsung SSDs I don't think I am going to be buying these again.

ZFS with zrepl
Borgmatic / BorgBackup for offsite
PostgreSQL incremental backups with pgBackRest

Everything is monitored. I also replaced five different Grafana services with a single monitoring platform built on OpenTelemetry and SigNoz. It works well, though replacing PromQL with SQL definitely has some growing pains.

In the next post, I’ll probably share how I do compute, Kubernetes from home and maybe another homelab oops, like the time I nearly lost all my family’s Christmas wishes 😉

I swear I’m a professional. But we all make mistakes sometimes. What matters is learning from them and fixing problems even when they seem impossible. I am also not a millionaire, so for my home lab I neither have the budget or the time to build fault-tolerant services. I try my best, especially for my own software, which I've never had problems with, but many other services just aren't built for high availability, requires very high resource requirements, setups, or simply just an enterprise license. I put in the effort where it is most fun and rewarding to work, and that is what having a home lab is all about.

Have a great Friday, and I hope to see you in the next post.