Ansible Does not Scale

Just got the news that Open-source Puppet has been crippled by the parent company Perforce. This is a very hostile move that they’re doing to squeeze more money out of their open source project.

And it made me think of me switching from Puppet to Ansible back in 2013, I never regretted this decision. But a lot of people will tell me: Ansible doesn’t scale. That’s true, but you might be using Ansible wrong if you expect it to manage 20 nodes or more.

Ansible is a great tool because you can easily define what to do in very simple terms. It’s less complicated than Terraform because it doesn’t maintain any state, and it’s more mobile than Puppet because it requires no agent and operates over SSH. This is also its biggest bottleneck, that it has to do everything over an SSH connection.

Why Ansible is great

When you’re working with Linux you end up running a series of commands, starting some services, editing some files, and this is exactly what you define in Ansible. Reading an Ansible playbook is just like reading a very simple shell script. Each line has a descriptive comment, and a command to be executed.

- name: install the command
  package:
    name: pngquant
    state: present

- name: run pngquant command to optimize image
  command: pngquant image.png --output optimized.png

- name: start web server
  service:
    name: nginx
    state: started

It doesn’t get more legible than this format, and it’s self-documenting.

This also means you can make repos of Ansible playbooks to run locally on a node, after perhaps cloning it over git or transferring it from an archive. This is how I’ve run CIS Benchmark hardening on air-gapped RHEL systems in the past, or how I maintain my workstation setup in Fedora.

Think outside of the Ansible

My Linux server work has evolved a lot during the last 25 years, the latest revolution to me is containers and container hosts. When everything is defined as a container it becomes easier to also re-define your classic Linux servers as appliances.

When scaling up to more nodes I don’t think Ansible slowly chugging through your hosts on your control machine is any fun. I’d rather deploy the nodes in a different way, namely provisioning each node as an appliance.

The main component for this is containers, but also immutable OS used as container hosts. OpenShift agent is a good example where it generates an ISO of an immutable OS that you can deploy on bare-metal with PXE or USB-sticks.

In the past I’ve done it using a CoreOS OVA template in VMware vSphere. All the setup was done on first boot using Ignition, and this is also where I could run Ansible if I wanted to.

In the future I’ll probably use bootc, as it’s set to replace rpm-ostree altogether.

For small and medium businesses Ansible over SSH is fine, but when you want to deploy large clusters like Ceph or Kubernetes you should really look into other options for provisioning, rather than trying to do everything with Ansible after install.

The larger your fleet of servers is, the more likely it is you have very standardized roles like workers and control servers. That’s two images you have to create and maintain, or with cloud-init it would be one image and two configuration files.

The disadvantage is that you need a staging environment to test these images in, and focus on monitoring tools built in, but the advantage is that you don’t even need SSH installed in production anymore.

So the takeaway here is treat Linux servers more like appliances, and you won’t have to run Ansible on 50 hosts.