Building Out the Tinkerbell CI/CD Infrastructure

Written by:

To build out the testing infrastructure for Tinkerbell, I reached for three of my favorite projects: Pulumi, SaltStack and Teleport.

We’re at a point in our technological journey that when we say “CI/CD test infrastructure”, we all, likely, think “containers”. However, not all projects are built the same, and it’s not uncommon for more “legacy” style CI/CD pipelines to still exist. While Tinkerbell is shiny, modern, and awesome, its tests require some unique considerations, like DHCP, VMs, and some weird networking setup beyond my naive understanding.

So, of course, I was the best person to build out the infrastructure 😅.

Pulumi, SaltStack, and Teleport Walk into a Bar

We’re big fans of Kubernetes at Equinix Metal, and we spend a lot of time working on the project and its subprojects — enough to know when not to use Kubernetes.

Kubernetes may be something that we use in the future for the Tinkerbell project, but our requirements at this point — and likely the next 12+ months — won’t require us to use Kubernetes. So if you came here for some juicy Kubernetes usage, I’m sorry to disappoint!

To build out our Tinkerbell testing infrastructure, I instead reached for three of my favorite projects.

Pulumi

We can’t consider anything ready for production if it is not automated. When it comes to provisioning hardware for any project, Pulumi and Terraform are great tools at your disposal, no matter how big or small. I opted for Pulumi, and have done so for most projects over the last two years because it allows me to leverage real programming languages to provision my infrastructure, rather than using a domain-specific-language (DSL).

Normally, I’d use TypeScript with Pulumi, as NodeJS languages have a fantastic ability to juggle JSON without batting an eye, and TypeScript’s type-system thrown into the mixer is a killer combination. However, I’m not working on the Tinkerbell project alone; I am part of a growing community filled with a variety of people actively involved and contributing to the project. I want them to feel happy, comfortable, and familiar if (or when) they want to contribute or tweak the CI/CD automation. For those reasons, I called on another super-power of Pulumi: choice. With Pulumi, I can use any of their supported languages, not just TypeScript. My colleagues and fellow contributors work on Tinkerbell in Go, so Go was the natural choice for the automation.

Something really difficult to get right in a project like this is secrets management. Pulumi provides built-in support for encrypting secrets within the Git repository, so we don’t need to pre-seed our infrastructure with secret management. We can leverage the default provider, Pulumi for Teams, which Pulumi has generously provided to the Tinkerbell project. Of course, Pulumi doesn’t restrict us here either. We can swap out the secrets provider, or even migrate away to any cloud KMS system like AWS KMS, GCP KMS, or others.

One last thing about Pulumi. Pulumi not only has their own native providers, but they also have the ability to wrap and consume Terraform’s providers too. This is amazing for people who want to try Pulumi, as there’s very little risk regarding missing or losing out on functionality that Terraform gets through it’s many third-party providers. In fact, we used a couple wrapped providers on this project.

You can check out our Pulumi automation here, but to give you a sneak peek at how we provision the infrastructure, here’s a quick look at a couple of resources.

Create an Elastic IP Block

elasticIP, err := equinix.NewReservedIpBlock(ctx, "salt-master", &equinix.ReservedIpBlockArgs{
	Facility: saltMasterConfig.Facility,
	ProjectId: pulumi.String(projectID),
	Quantity: pulumi.Int(1),
})

Create a Device

deviceArgs := equinix.DeviceArgs{
	ProjectId: pulumi.String(projectID),
	Hostname: pulumi.String(fmt.Sprintf("%s-%s", ctx.Stack(), "salt-master")),
	Plan: saltMasterConfig.Plan,
	Facilities: pulumi.StringArray{
		saltMasterConfig.Facility,
	},
	OperatingSystem: equinix.OperatingSystemUbuntu2004,
	Tags: pulumi.StringArray{
			pulumi.String("role:salt-master"),
	},
	BillingCycle: equinix.BillingCycleHourly,
	UserData: teleportConfig.PeerToken.Result.ApplyString(func(s string) string {
		bootstrapConfig.teleportPeerToken = s
		return cloudInitConfig(bootstrapConfig)
	}),
}
device, err := equinix.NewDevice(ctx, "salt-master", &deviceArgs)

Create a DNS Record with NS1

_, err = ns1.NewRecord(ctx, "teleport", &ns1.RecordArgs{
	Zone: pulumi.String(infrastructure.Zone.Zone),
	Domain: pulumi.String(teleportDomain),
	Type: pulumi.String("A"),
	Answers: ns1.RecordAnswerArray{
		ns1.RecordAnswerArgs{
			Answer: elasticIP.Address,
		},
	},
})

SaltStack

OK. So we can provision hardware, but how do we get our workloads running on them? Configuration management is extremely common, extremely important, and already extremely well understood.

Picking a tool mostly comes down to personal preference. While SaltStack is mine and we’ll explore using it for basic configuration management, I also want to share a few features that should elevate it in your own preferences and hopefully encourage more adoption of this awesome tool.

Chickens and Eggs

Of course, we need to get SaltStack onto our machines before bringing them under management of SaltStack. Equinix Metal provides support for a few userdata formats that can help with this. I’m a fan of cloud-init, and we’ve used that for Tinkerbell to handle the initial bootstrapping of our devices.

Here’s where things get a little interesting. Usually, we’d use a standard YAML file to manage our cloud-init configuration and render it into our userdata. However, because we’re using Go with Pulumi, we can leverage any libraries within the Go ecosystem. The Juju project has a handy little cloud-init abstraction that makes building our cloud-init dynamically through declarative code really simple.

c, err := cloudinit.New("focal")
c.SetSystemUpdate(true)
c.AddPackageSource(packaging.PackageSource{
	Name: "saltstack",
	URL: "deb http://repo.saltstack.com/py3/ubuntu/20.04/amd64/3002 focal main",})
c.AddPackage("salt-master")
c.AddPackage("salt-minion")
script, err := c.RenderScript()

Standard Configuration Management

# Install a standard package
certbot:
	pkg.installed

# Install a remote package
teleport-install:
	pkg.installed:
- sources:
- teleport: https://get.gravitational.com/teleport_5.1.0_amd64.deb

# Create and manage the Teleport configuration
teleport-config:
	file.managed:
- name: /etc/teleport.yaml
- source: salt://{{ slspath }}/files/teleport.yaml
- template: jinja
- owner: root
- group: root
- mode: 644

# Service management
teleport-service:
	service.running:
- name: teleport
- enable: True
- reload: True

Delegated Secrets with Pillars

In this example, our Salt Master provides the AWS credentials needed to sync LetsEncrypt certificates to an S3 bucket. Pulumi has these credentials encrypted in the Git repository, and we provision them via cloud-init to the Salt Master machine. The GitHub runners won’t have access to them, but we could delegate access by modifying the top.sls file for that Pillar data. This is a fantastic feature of SaltStack for keeping secret information in a single location and having a declarative API to delegate it.

We sync this whenever we rebuild the infrastructure to ensure we don’t run into rate limit problems with the LetsEncrypt API. You’ll also notice a “- watch” condition on our state; this ensures that we resync the certificates whenever those files change. Great for keeping an eye on the renew commands!

teleport-s3-sync-up:
	cmd.run:
- name: aws s3 sync /etc/letsencrypt s3://{{ pillar.aws.bucketName }}/letsencrypt/
- env:
	- AWS_ACCESS_KEY_ID: {{ pillar.aws.accessKeyID }}
	- AWS_SECRET_ACCESS_KEY: {{ pillar.aws.secretAccessKey }}
	- AWS_DEFAULT_REGION: {{ pillar.aws.bucketLocation }}
- watch:
	- cmd: teleport-tls-new

Event Loop

SaltStack runs on an event loop. All communication is actually a message on ZeroMQ that can be subscribed to on each of the minions. We have the ability to produce and consume events, giving us near infinite flexibility to cause and react to nearly any event within our infrastructure. We’re not currently doing anything with this yet, but I am experimenting with using eBPF.

eBPF Fun

There are many eBPF use-cases for advanced networking protections and monitoring, but I’m going to use a slightly more fun example. Let's assume that we want to trigger a Sat event anytime a DNS lookup happens. The eBPF code for this is provided in the iovisor/bcc examples directory. With just a tiny modification, we can have this eBPF program emit Salt events.

import salt.client
caller = salt.client.Caller()
ret = called.cmd("event.send", "eBPF/dns/lookup", { "domains": dnsrec.questions })
if not ret:
# the event could not be sent, process the error here

With these events being emitted, we can now write Salt states that perform any arbitrary reaction. We can send a message to Slack, we can write the event to InfluxDB, or we can, in real-time, modify the networking configuration to block access to the IPs for domains being resolved.

Teleport

Tinkerbell is a CNCF project, which means the Foundation provides guardianship for the project, and no one person “owns” it. Knowing that, sticking my own SSH keys on devices doesn’t really pass any single point of failure test. I wanted to provide access to the infrastructure, as required, in a secure, auditable, and safe manner.

Teleport calls itself a “unified access plane for infrastructure”. It provides a mechanism for democratizing access to Linux, Kubernetes, HTTP applications, and databases. By acting as an authenticated proxy, we can configure Teleport only to allow access to the members of a certain GitHub group on the Tinkerbell organization. We can actually uninstall our SSH daemon from all of our servers, relying exclusively on Teleport to provide SSH, which is audited, allowing the Tinkerbell team to playback any recorded session. This means that we don’t need to distribute keys to our team, we can enforce 2FA, and we can even facilitate team members to collaborate on SHARED terminals through the Teleport UI.

Let me show you.

That’s All Folks

That’s our Tinkerbell CI/CD infrastructure!

We’re still early in our development, but we’ve got a strong, stable, and secure foundation to continue to build it out as our needs change. If you’re interested in contributing to the project, take a look at our introductory videos on YouTube, or join our bi-weekly community calls or Slack channel.