Self-hosting apps on Kubernetes with OpenTofu

· 9min

Like many others, I first began self hosting on Proxmox using Community Scripts. These scripts were very easy to use: by simply pasting in a curl-bash script, it will automatically set up an app for you, including the complete infra and the software deployment. But over time, some annoyances surfaced:

  • It's difficult to audit the scripts yourself. While they are open-source, the contents of the scripts aren't laid out in front of you immediately. And every time we need to update, if we want to be vigilant, we have to look over the script again in addition to vetting the app itself. With the rise of supply-chain attacks, one must be more careful when handling updates, so this is a valid issue if we want to be err on the side of caution.
  • There is a lack of declarative management and a lack of abstraction between infrastructure and app layers. The app hosting LXCs are presented as is, and we can do whatever we want within the containers. But this really makes it harder to track the current state of the app deployment.

For these reasons, I decided to move from the simple community script/LXC approach to using IaC (infrastructure as Code) with Kubernetes. While it was quite complex to set up at first, eventually it made managing and maintaining the app deployments many times easier. It even unlocked the possibility of using AI coding agents to manage the whole deployment by solving the state management problem.

This post gives a high-level overview of my current Proxmox infrastructure setup using Talos Linux to manage Kubernetes and Kustomize & Helm templates for deploying apps.

Infra: Talos Linux and Terraform/OpenTofu templates

I first discovered Talos Linux during my research on how to provision and deploy Kubernetes workers and control plane nodes.

Talos is an immutable Linux distro by Sidero Labs for the sole purpose of running Kubernetes. As such, by deploying Talos, all of the work of installing Kubernetes has already been done for you. There's also a CLI tool, talosctl, for managing Talos. We can use this to easily upgrade/downgrade the OS and add plugins to help with virtualization or GPU hardware, and the process will pretty much never break.

But what makes Talos even more useful is that it's very easy to provision using OpenTofu. Sidero provides a Terraform/OpenTofu template for this. To set this up, I created a basic Ubuntu LXC for dev purposes on Proxmox and installed the necessary tools to use OpenTofu and Kubernetes tools. This is generally a good idea if you want to work on the project from multiple computers, just so we don't need to copy the Tofu state files around. I then connected it up to the Proxmox server using the bpg/proxmox provider.

To deploy a Kubernetes cluster proper, we need both a worker and control plane node, so two Talos Linux VMs. To do this, I created an OpenTofu template declaring the configuration of the worker and the control place. This provisions two Talos VMs on the Proxmox host, one for the worker and the other for the control plane. For example, the control plane part of the template looks something like this:

data "talos_machine_configuration" "controlplane" {
  cluster_name     = var.talos_cluster_name
  cluster_endpoint = local.talos_cluster_endpoint
  machine_type     = "controlplane"
  machine_secrets  = talos_machine_secrets.this.machine_secrets
  talos_version    = var.talos_version
}

data "talos_client_configuration" "this" {
  cluster_name         = var.talos_cluster_name
  client_configuration = talos_machine_secrets.this.client_configuration
  nodes                = [local.controlplane_node_ip]
  endpoints            = [local.talos_cluster_endpoint]
}

resource "talos_machine_configuration_apply" "this" {
  depends_on = [proxmox_virtual_environment_vm.talos]

  node                        = local.controlplane_node_ip
  endpoint                    = local.controlplane_node_ip
  client_configuration        = talos_machine_secrets.this.client_configuration
  machine_configuration_input = data.talos_machine_configuration.controlplane.machine_configuration
  config_patches = [
    yamlencode({
      machine = {
        install = {
          disk = var.talos_install_disk
        }
      }
    })
  ]
}

With tofu apply and some minutes of waiting, the two VMs spun up and I was able to verify its status by talosctl stats and that it shows up in PVE. It is important to let OpenTofu generate the Talos credentials for you and output a talosconfig file so we can manage the Talos VMs later.

Apps: Kustomize and Helm templates

Now that the basic Kubernetes cluster is healthy, we can now deploy some apps on it.

The simplest way to do this was to create some Kubernetes YAML templates and apply it. A few things to note however:

  • It's very important to have a dedicated namespace for each app, so the resources and scope of an app can be easily determined.
  • Since a lot of the apps are going to share a very similar software infrastructure (e.g. database), we can factor out the common features into separate deployments.
  • We need a dedicated reverse proxy, so we can securely connect to the apps from the outside.
  • We also need to mount a disk for persistent storage.

Traefik, Cert-manager

First we need to set up an ingress controller or reverse proxy for all the apps. I found that Traefik is a pretty good tool for this requirement.

Using its Helm chart along with additional Kustomize templates, I created a Traefik deployment and have it act as the main ingress control for the cluster.

helmCharts:
  - name: traefik
    repo: https://traefik.github.io/charts
    releaseName: traefik
    namespace: traefik
    valuesFile: ../../helm/values/traefik.yaml

All inbound connections will go through Traefik, and based on the URL it will route the connection to one of the apps. This is quite convenient since we only need one such reverse proxy for all the apps. The DNS configuration is simple as well: just add a Wildcard DNS record pointing to the Talos worker node IP.

Of course we also need to set up SSL so the app connections are secure. Again there is a good tool for it: cert-manager, along with its Helm chart.

helmCharts:
  - name: cert-manager
    repo: https://charts.jetstack.io
    releaseName: cert-manager
    namespace: cert-manager
    valuesFile: ../../helm/values/cert-manager.yaml

Combined with Traefik, I was able to set up the complete reverse proxy setup with SSL, and we can now connect to the apps over HTTPS via its URL.

App containers

My core app stacks consist of Immich, Jellyfin and Matrix (Synapse). Some of them have official Helm charts which can be conveniently used, but for those that don't, I chose to write my own Kustomize templates for them.

Immich has its own Helm chart. We can use this to generate a workloads.yaml file, which includes the full Immich microservice architecture with a backend, machine learning and Valkey service. But for most apps with a Helm chart, an explicit generation process is not required, we can simply use kubectl kustomize --enable-helm to render the Helm chart into Kustomize templates automatically. This downloads the charts and generates the Kustomize templates without storing it locally.

For apps that don't have an official Helm chart such as Synapse, I simply wrote a plain Kustomize YAML and applied it via kubectl kustomize.

CloudNativePG

The apps also require persistent content to be stored in databases. While we can very easily have each app set up their own database storage, it's better to have a centralized operator to manage all of this.

CloudNativePG adds a way for apps to provision databases as a resource. This greatly simplifies database configuration with apps, they now can declare database usage using the Cluster resource.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: synapse-media
  namespace: matrix
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ""
  volumeName: synapse-media
  resources:
    requests:
      storage: 20Gi
---
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: synapse-database
  namespace: matrix
spec:
  instances: 1
  imageName: ghcr.io/cloudnative-pg/postgresql:16.8
  storage:
    size: 10Gi
    pvcTemplate:
      accessModes:
        - ReadWriteOnce
      storageClassName: ""
      volumeName: synapse-database-data
      resources:
        requests:
          storage: 10Gi
  bootstrap:
    initdb:
      database: synapse
      owner: synapse
      secret:
        name: synapse-database-app

Persistent storage

In addition to databases, apps like Jellyfin need to attach a dedicated persistent media library.

If I had a separate NAS server, the best option would be something like setting up an NFS provisioner in Kubernetes. Since my current storage setup is simply some HDDs mounted on the Proxmox host, the more practical solution was to pass the mounted directories into the Talos VM. Another alternative would have been to pass in the SATA controller for the drives; this would give better performance but is harder to set up.

I configured the directory passthrough in the Proxmox host, allowing the Talos worker VM to access the drives. Then, I configured local path provisioner for Kubernetes backed by the drive directory. This allowed us to point the PV to the media directory, which enabled Jellyfin to access our media files.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: jellyfin-media
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: ""
  volumeMode: Filesystem
  local:
    path: /var/mnt/media/jellyfin
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
                - talos-worker-01
  claimRef:
    namespace: jellyfin
    name: jellyfin-media

Nvidia GPU Passthrough

Immich and Jellyfin benefit from a GPU for hardware accelerated video codecs and ML workloads. My server has an Nvidia GPU, but Talos does not include Nvidia drivers out of the box, it requires an Nvidia driver extension. This can be configured in the Talos Linux Image Factory, by adding the Nvidia container toolkit and the Nvidia kernel driver. So I updated the Talos worker VM using talosctl upgrade --image <url>, which essentially installed the Nvidia drivers.

Besides configuring the worker, we also needed to set up GPU passthrough on the Proxmox host side. This is a common process done in many Proxmox VM setups. After this, the Nvidia GPU showed up in the VM like this with talosctl get devices --nodes <worker_ip>:

192.168.0.100   hardware    PCIDevice   0000:01:00.0   1         Display controller         VGA compatible controller   NVIDIA Corporation   GA106M [GeForce RTX 3060 Mobile / Max-Q]
192.168.0.100   hardware    PCIDevice   0000:01:00.1   1         Multimedia controller      Audio device                NVIDIA Corporation   GA106 High Definition Audio Controller

For multi-computer Kubernetes clusters, to separate deployments that require a GPU or not, we can define a runtime class for Nvidia.

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia

Then, for a service that requires an Nvidia GPU,

runtimeClassName: nvidia

resources:
  limits:
    nvidia.com/gpu: 1

Results

By provisioning Talos VMs with OpenTofu and managing the apps with Kustomize templates and Helm charts along with several services to simplify the process, I successfully migrated all my app deployments to Kubernetes. This made it much easier to understand what the infrastructure looks like, by simply reading the various levels of IaC templates, and having Kubernetes makes the deployment more robust and scalable.

Not only this, IaC templates makes it feasible to use coding agents to manage our deployment. While it's certainly not a good idea to use it for production, if you have a good review and backup strategy, coding agents can be used to quickly write templates for new apps. For example, I used OpenCode with GPT-5.5 to create the Kubernetes templates for Immich, and it was able to do so quickly and accurately with the help of linting tools. After deployment, the app worked identically as before, and it even managed to perform the database import successfully without intervention.

In the future, the templates can be expanded to support multiple devices in a cluster. And the error-tolerant nature of Kubernetes makes failures rare, while being fairly straightforward to debug with kubectl. This is why I encourage everyone to learn how to use IaC tools like Terraform/OpenTofu and set up a Kubernetes cluster, even for self-hosted deployments.