Preface

I ran into a lot of trouble figuring out how to write this post. On one hand, I plan to use this post to demonstrate some of the skills and techniques I've accrued over years of homelabbing, especially to recruiters or managers. On the other hand, I want to write faithfully to my usual rambling and colorful writing style for the small group of friends that may read my posts – and to myself, who truly is the primary intended audience of these posts.

In particular, writing this post has made me realize how far I've developed my skills in systems administration, system infrastructure design, DevOps, etc. I really wanted to reflect with great sentiment on my long journey: where I started and the arduous paths I took. I wanted to pay respects to everything that lead me to where I am today.

After much deliberation, I have decided to reserve such sentiments for a separate post. I want to dedicate this post to being as technical and straight-to-the-point as possible, as if I were writing a project specification or documentation on my systems for future generations to maintain. That being said, I will not stop to explain each and every concept. To do so would be to document my history in homelabbing, systems design, networking, etc., which is much outside the scope of this post. I will assume the reader to be already familiar in a wide variety of areas so that I can cover as much ground as possible.

In addition, for those unfamiliar with my writing style, I annotate sections of text that are intended as asides to provide additional context or exist simply for flavor with the word Chaff, alluding to the popular idiom, "to separate the wheat and chaff". This is inspired from Leary and Kristiansen's A Friendly Introduction to Mathematical Logic (2015) textbook, where they use the same label in a similar fashion.

Ultimately, I hope to appeal to the recruiters or technical representatives or perhaps even novice homelabbers that will read this post. In this post, I hope to demonstrate and transfer much knowledge to you, dear reader.

Hardware

My primary Proxmox virtualization host is a Dell Poweredge R730XD, with a 48-core Intel Xeon 4116 processor running at 2.10GHz over 2 sockets with 256GB of RAM and ~10TB of logical storage onboard, connected to a Perc H730P adapter on RAID6, all encased in the 2U chassis the R730XD comes in.

I also have a separate TrueNAS machine with a Intel Xeon E5-2650 processor running at 2.60GHz with 64GB of RAM and ~40TB of logical storage using a M5015 RAID card flashed into HBA mode, connected to a JBOD enclosure, hosting 15 hard drives. This machine additionally has a GTX 1650 connected to it for transcoding media and a 4-port 1Gb/s NIC.

Connecting these two machines is a HP Procurve 2900 switch. While this topic exceeds the present discussion on hardware, I find nowhere more appropriate than here to mention that both the Proxmox host and the TrueNAS host are network-bonded (Linux bond) to the switch in LACP (802.3ad link aggregation) mode, for a total bandwidth of 4Gb/s – though unfortunately still limited to 1Gb/s per TCP connection due to the limitations of LACP bottlenecking each TCP connection to the speed of a physical link.

TrueNAS

Starting with the simpler TrueNAS host, this machine (mostly) acts as a dumb storage controller for my Proxmox host to store large files and for cold storage (e.g. filesystem backups of my old machines). Historical factors have lead me to creating my ZFS pool with 2 data vdevs in RAIDZ2 with differing capacities: one with 6 physical disks, and one with 9 physical disks. While obviously not optimal, this was set up in a time when I was less familiar with ZFS. I have enough data in this pool at this point that it would be impractical to re-optimize in-place. I hope to build an SSD storage array in the future and migrate my data off this pool in that time so I can re-balance the vdevs.

You may be wondering why this TrueNAS host isn't also running Proxmox, so that I can form a cluster with the other Proxmox host. This, too, is a matter of historical contingency rather than a carefully-planned decision. While I knew I could pass through necessary PCI slots on QEMU+KVM so that a TrueNAS VM had direct access to the HBA controller (probably, assuming my IOMMU groups were valid), I thought that TrueNAS would be much heavier than it actually was and needed the full power of the host, given that I had previously run TrueNAS on a severely bottlenecked machine. This turned out to be very incorrect, and I am now unfortunately stuck with TrueNAS instance that does basically nothing else except export storage shares to the network, wasting a ton of resources.

This isn't entirely true though, because I ended up running another VM in TrueNAS and passing through a spare GTX 1650 I had. Given that my media was stored on this host and that I wasn't making full use of its power, I decided to run my Jellyfin server in a VM under TrueNAS for lower latency to the block storage (well kind of, the VM still unfortunately has to connect to the media ZFS dataset on the TrueNAS host through a network share, bleh!) as well as being able to easily access another x16 PCIE slot for transcoding. I've pinned about half of the CPUs to the VM according to the L1 cache topology to further optimize the VM.

Given that I was still far from making full use of the TrueNAS host, I also put some databases/storage on it through TrueNAS Scale's built-in applications: MySQL, Postgres, Redis, Minio. These act as general-purpose databases that I use in other applications I deploy. In fact, this Ghost deployment is backed by a MySQL database on the TrueNAS host.

While I could deploy more applications on the TrueNAS host, I've decided to limit it to just these. Deploying applications on the TrueNAS host – especially those not officially supported – feels fairly finnicky and I dislike the fact that these applications are not stored as IaC (infrastructure as code). There are just better options for deployments, which we'll touch on later. I only chose the TrueNAS host for the database deployments because it had direct access to the fast ZFS pool and I was fine with having my databases outside of IaC.

Proxmox

This is where the bulk of my infrastructure lies. Almost all of my configuration can be found in my Gitea repo. To offer original content, this post will summarize the configuration, describe the configuration on a higher-level, as well as document the configuration missing from the repo.

I will first describe my infrastructure in the rough before explaining how it's managed.

DNS Server – Technitium

Before a large migration carried out a few months ago, I used PiHole as my primary internal DNS server, allowing me to split DNS and allowing SNI for private, local services. During the migration, I decided to upgrade to a different DNS server, Technitium, which offered much more powerful DNS management capabilities.

I couldn't quite abandon my PiHole immediately though, because it contained a ton of DNS records that would be cumbersome to import and I had hard-coded my DNS server IP in multiple places. I had simply never imagined the DNS server IP ever changing. At the time, I also didn't know much about virtual IPs, so that was out of the question. If I shut down the PiHole, I didn't know what services would go down.

Instead, I created my new Technitium server, set my PiHole as a forwarding server to maintain backwards compatibility with my old records, and changed the PiHole server IP to the Technitium in wherever I could find.

Even so, I'm somewhat paranoid to fully destroy the PiHole, and I don't want to reconfigure all the blocklists, so the PiHole still sits alive to this day, as a companion to the Technitium DNS server. The DNS server(s) sit as a fundamental pillar of my internal networking stack.

Jumper

I have a VM running Docker standalone that runs basic workloads, workloads that require fast, persistent storage – particularly with high random R/W like databases – or workloads that other infrastructure softly depend on, like my Gitea server that I use for GitOps. I call it Jumper because it acts to "jumpstart" the rest of my infrastructure without the rest of my infrastructure strictly relying on it to run. Apart from the Gitea server, I run Portainer for GitOps and I run my media stack on Jumper because the SQLite database some of the applications in the media stack use are best bound to the locally-attached disk rather than a remote share.

Stingray

Stingray is the name I give to my Docker swarm, with a 3-node manager and 3-node worker topology.

Chaff: Why Stingray?

I don't have a particular reason for this naming, unlike my reasons for Jumper. Really, I just wanted to avoid naming things exactly for what they are because I've found such a naming scheme to be surprisingly confusing. For example, my previous Docker swarm was just named Swarm, with VMs in the swarm being named swarm-master-1, swarm-slave-1, etc. The problem arises if I want to introduce another Docker swarm. What would the naming convention be? Something like swarm-2-<role>-<n>? I find it harder to remember what Swarm-2 does, as opposed to just Swarm or Swarm-1. Instead, I find it much easier to remember what things do when I give them proper names, like Jeremy, or Joel. Sure, someone new to these resources may find the naming scheme confusing at first, and wouldn't be able to tell what these resources are, but I find this tradeoff more than worth it.

Stingray is in a bit of a strange spot. It's doomed to its fate as a Docker swarm, being an awkward middleground between a single-node Docker instance and a full-blown Kubernetes cluster. Currently, Stingray runs workloads that don't require fast, persistent storage. Notably, it runs a phpMyAdmin and pgAdmin instance as a UI to my databases, and Infisical as a secrets management replacement to Hashicorp Vault. It does run some other workloads like Portainer, Traefik, TrueCommand and small apps I write, but nothing really interesting.

Chaff: Why is Stingray so useless?

I originally wanted Stingray to act as my Jumper: a lightweight resource that could help jumpstart the rest of my infrastructure as well as run general workloads that aren't suitable for Kubernetes, like my media stack. Unfortunately, Stingray became heavier than I thought, and Docker swarm still has no great CSI (container storage interface), even after all these years, making it difficult to justify putting persistent, IO-heavy workloads on it. Yes, I've tried GlusterFS. Yes, I've tried storage plugins. For different reasons, these solutions are simply not as convenient as running the workload on the single-node Docker instance or using Kubernetes' CSI to allocate storage.

I believe the fact that Stingray is in such an awkward spot really speaks to the state of Docker swarm as a feature. It seems like not many people want to keep improving Docker swarm when Kubernetes exists as a better large-scale solution and when a single-node Docker instance exists as a better small-scale solution, making Docker swarm forever doomed to be left in the dust by both parties.

Moirai

Moirai (pronounced "mee-reh") is the name given to a group of nodes used in my Pterodactyl cluster. The Pterodactyl cluster could have an entire blog post dedicated to it, so I won't go into too much detail here. Put simply, I have a Pterodactyl panel container running on Jumper that interfaces with agents installed on the Moirai nodes to manage game servers like Minecraft, Don't Starve Together, Project Zomboid, etc. The Pterodactyl panel allows me to securely give access to some of my computing resources to friends who want to manage game servers. If you're curious on how these Moirai nodes are accessible via the internet, see my other post.

Dolo

Dolo is my premier resource, a (currently) 9-node k3s cluster, including my 3-node HA control plane. It runs with a flannel CNI, kube-vip in ARP mode for the HA control plane, and MetalLB on L2 mode for LoadBalancer support. This has been made easily automate-able through Techno Tim's k3s Ansible playbook.

Dolo runs all of my other workloads, including, but not limited to:

Additionally, Dolo serves as the main entrypoint to all of my HTTP services from the internet. All ingress HTTP traffic passes through Traefik via the cloudflare tunnel and ends up getting logged into Loki, making it easy to audit access. The cloudflare tunnel (and other tunnels) have allowed me to avoid opening any port on my router and very strictly and explicitly control and audit access to my home network. Read about how I exposed non-HTTP services in my other post.

While I could tirelessly expound on all of these services and their configurations, it would be easier if I just showed them to you. Fortunately, I can do just that, now that I've adopted a GitOps workflow for my homelab. So, instead of going through all of the configuration, I'll leave you to check out the Git repo yourself, while we talk about the structure of the repo as a whole.

Git

Git is the primary method by which I manage my configuration. This gives me a centralized location to reason about all my infrastructure as well as a way to reproduce it, should anything happen to it.

Let us dissect the anatomy of the Git repo to gain insight into how this is done.

Terraform/OpenTofu

The tf directory is where I keep my Terraform (OpenTofu) configuration. That's right, Proxmox can be interfaced with through Terraform: I manage the lifecycle of all of my VMs through the Terraform configuration now. The details of how to get this set up are provided in the README and Andreas' post and I shall not repeat them here. I created Terraform modules for my architecture components, notably the clusterized components like Docker Swarm and Kubernetes. This allows me to scale up and down my infrastructure easily, by simply changing a number and running a script, and even create test clusters if I so choose.

Chaff: GitOps?

If you are asking whether the Terraform configuration is applied automatically, the answer is no.

Maybe this is violating some GitOps principle, but like any idea, we should take what we find useful from it. In my case, I find that the most valuable part of GitOps is the versioned, centralized, and declarative management approach to infrastructure, not necessarily the automation part.

I prefer leaving vital configuration like Terraform infrastructure outside of automation and given that I'm the only one working on this, I won't have any problems with state management.

Chaff: OpenTofu?

OpenTofu is a FOSS fork of Terraform maintained by the Linux Foundation. You can read more on OpenTofu's FAQ for why they forked Terraform. For the end user, OpenTofu provides a much less restrictive license while doing a great job of remaining compatible with Terraform. As an avid FOSS enjoyer myself, making the move to OpenTofu was obvious.

That being said, since Terraform is still more "official", I'll be referring to all the work I do with OpenTofu as "Terraform". The reader should keep in mind that in actuality, I have completely replaced Terraform with OpenTofu though.

Ansible

While Terraform does a great job of managing the lifecycle of VMs, I use Ansible to provision the VMs in my ansible directory. This connection is made incredibly easy with Terraform's Ansible provider as well as Ansible's Terraform inventory plugin. Simply put, I can define a VM resource and add it to an Ansible group via the Ansible provider in Terraform, and then use Ansible's Terraform inventory plugin to generate an inventory dynamically based on the Terraform resources.

Some of the Ansible roles I employ are ones to set up a k3s cluster and set up a Docker swarm. I try to keep Ansible automation to a minimum, preferring to use a GitOps-based approach for as much as I can, so most of my Ansible playbooks are simply for initial provisioning of VMs.

DNS Servers

The dns directory is really simple. I manually version control the records by saving them to the Git repo and committing whenever I make a change on the Technitium server.

Chaff: bUt thAt's NoT VerY gITOps Of yOU!

I thought about using a Terraform provider to apply DNS configuration, using some scripts to automatically synchronize DNS configuration to and from Technitium, or using something like bind9 as my DNS server, which works with .zone files directly to make this easier, but I ultimately decided against it. I definitely could not use a Terraform provider to apply DNS configuration in the same Terraform project, because I'd end up creating a circular dependency. Scripts to synchronize the DNS configuration seemed overkill.

Docker

The docker directory in my repository simply contains a repository of Docker compose files (for a standalone Docker instance) or Docker stacks (for a Docker swarm). Portainer uses these manifests to keep my configuration up-to-date, following a GitOps approach. Not all of the manifests are used, but I let them remain in case I ever want to bring them back up in Portainer.

My current workflow deploying applications is admittedly a bit awkward:

  1. I set my Docker context to my standalone Docker instance or the Docker swarm
  2. Write a compose file, leaving configuration options into a .env file
  3. Deploy the compose file and make modifications as needed
  4. Once I see that it's working, I destroy the application
  5. Commit and push
  6. On Portainer, re-deploy the application through Git

Chaff: GitOps Woes

If I chose to go with the "pure" GitOps approach and commit a potentially broken initial compose file and iterate fixing over commits, I would be left with a really ugly commit history unless I did some squashing.

I've read from some sources that the solution is to just thoroughly test the configuration, use things like pre-commit hooks or some sort of validation before committing, but the same fundamental problem remains in that I can't actually know if some configuration works until I actually try applying it.

The way I approach this currently is to iteratively work on the configuration (perhaps on a different environment initially) and deploy it outside of GitOps to ensure it works before committing it and letting it be managed by GitOps. In the case of Portainer, I have to have the additional step of first deleting the configuration because Portainer doesn't know how to manage stacks deployed outside of Portainer.

k8s

Last, but certainly not least, we have the k8s directory where I keep my Kubernetes manifests. As mentioned earlier, I use FluxCD as my GitOps tool for my Kubernetes cluster. The reason why I chose FluxCD and not another GitOps tool like ArgoCD is frankly because FluxCD was a lot easier to set up, learn about, and FluxCD seemed a lot more lightweight than ArgoCD.

The workflow I employ to manage my cluster is now:

  1. Test out changes locally and possibly apply some manifests to ensure they work
  2. Make sure these manifests are referenced by FluxCD
  3. Commit and push
  4. FluxCD receives a webhook notification, sends out alerts to notification channels, and starts reconciling the differences
  5. Manifests are now version-controlled

Wrap-up

Admittedly, there is still much detail of my homelab left out of the tight confines of this post. I haven't talked about my snapshotting and backup strategy, ZFS dataset layout, subnet configuration, etc. Unfortunately, covering the ground fully is impossible in a single blog post, in part due to the legitimate complexity of the years-old homelab, and in part due to the unrelenting passion of the homelabber, possessing them to proudly overindulge in their lengthy explanations. Blame them not though, for how could one similarly blame a child when they present a realized product of their imagination, excitedly babbling senselessly about it? But I digress.

I hope to have presented a small taste of my homelabbing experience: the technologies I've come into contact with, some skills I employ, etc. As I have incessantly reminded the reader throughout this post, much has been left out – certainly all the pain and suffering learning and debugging, despite this process being arguably the most valuable. If you're curious about certain aspects of the homelab – whether you're a representative seeking to ascertain my ability or an aspiring homelabber to understand it for yourself – I encourage you to ask.

Happy hacking!

Homelab Tour (2025)

A wide tour of my homelab infrastructure: hardware, software, and how I manage it with a GitOps workflow.