Introduction to Ansible Playbooks: Deploying a Rock-Solid High Availability Ceph Cluster

Looking to build a high availability Ceph cluster with ease? Ansible Playbooks have your back! Whether you're scaling out storage for a home lab or enterprise setup, automating your Ceph deployment is key to reliability and efficiency. In this guide, I'll walk you through step-by-step how to set up a resilient, high availability Ceph cluster using Ansible Playbooks—so you can focus on your data, not the details. Let's get your cluster up and running like a pro!

This gist covers basic Ansible Playbooks for simplifying the redundant tasks for provisioning your Ceph Clusters.
We will be using Microceph to build a High Availability(HA) Ceph Cluster.

Prerequisites

A bare minimum of 3 Linux based nodes that support Snapd. These can be virtual machines, or bare-metal systems that live within a network boundary where they can reach each other.

For example: I am doing this on 3 Orange Pi 3Bs that I have at home. Each have a self-compiled Linux Image that has kernel support for Ceph and RBD. Please refer to my Medium article for more information.

I do want to point out that these are OPi 3B 8G V1 boards, which are still being sold on Amazon. But they have reached EOS and taken off shelf. Please don't buy them before doing research. There are kernel and hardware changes that will break either u-boot or Wifi and Ethernet based on the version of the base linux kernel you'd be compiling your image with - for Ceph/RBD support if you wish to use Armbian.

The Orange Pi 3B has issues with both Joshua Riek's Ubuntu images and OPi's official images where for some reason there is a constant load on your core processors even during idle time. I didn't think using an ISO where you lose 25 percent of your compute power was very appealing and hence went the Armbian route where you don't have that issue.

Please do not use SBCs for Production deployments. Feel free to use a HA Proxomox Cluster or a Metal3 based topology on more robust server grade systems for production deployments.

(Optional) Security Hardening

In this section we are going to do some of the following security hardening for our bare machine nodes.

Refer to this redhat documentation for securing ssh for more details. Just create a sshd_config file and copy them over to each node at /etc/ssh/ and then restart the ssh service: sudo systemctl restart ssh

Updated SSH configs to only allow a single user with no password login and only Authorized Key logins.
Change the default port for the ssh server to some arbitrary value.
Don't allow Root Login
Use Protocol 2
Set some max concurrent sessions to a low value like 2.

Sample sshd_config. Please make necessary adjustments (like for Port, AllowUsers <ssh-user-name>, etc) before restarting the ssh service.

# This is the sshd server system-wide configuration file.  See
# sshd_config(5) for more information.

# This sshd was compiled with PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games

# The strategy used for options in the default sshd_config shipped with
# OpenSSH is to specify options with their default value where
# possible, but leave them commented.  Uncommented options override the
# default value.

Include /etc/ssh/sshd_config.d/*.conf

Port 2322
#AddressFamily any
#ListenAddress 0.0.0.0
#ListenAddress ::

#HostKey /etc/ssh/ssh_host_rsa_key
#HostKey /etc/ssh/ssh_host_ecdsa_key
#HostKey /etc/ssh/ssh_host_ed25519_key

# Ciphers and keying
#RekeyLimit default none

# Logging
#SyslogFacility AUTH
#LogLevel INFO

# Authentication:

#LoginGraceTime 2m
PermitRootLogin no
#StrictModes yes
#MaxAuthTries 6
#MaxSessions 10

PubkeyAuthentication yes

# Expect .ssh/authorized_keys2 to be disregarded by default in future.
#AuthorizedKeysFile	.ssh/authorized_keys .ssh/authorized_keys2

#AuthorizedPrincipalsFile none

#AuthorizedKeysCommand none
#AuthorizedKeysCommandUser nobody

# For this to work you will also need host keys in /etc/ssh/ssh_known_hosts
#HostbasedAuthentication no
# Change to yes if you don't trust ~/.ssh/known_hosts for
# HostbasedAuthentication
#IgnoreUserKnownHosts no
# Don't read the user's ~/.rhosts and ~/.shosts files
#IgnoreRhosts yes

# To disable tunneled clear text passwords, change to no here!
PasswordAuthentication no
PermitEmptyPasswords no

# Change to yes to enable challenge-response passwords (beware issues with
# some PAM modules and threads)
KbdInteractiveAuthentication no

# Kerberos options
#KerberosAuthentication no
#KerberosOrLocalPasswd yes
#KerberosTicketCleanup yes
#KerberosGetAFSToken no

# GSSAPI options
#GSSAPIAuthentication no
#GSSAPICleanupCredentials yes
#GSSAPIStrictAcceptorCheck yes
#GSSAPIKeyExchange no

# Set this to 'yes' to enable PAM authentication, account processing,
# and session processing. If this is enabled, PAM authentication will
# be allowed through the KbdInteractiveAuthentication and
# PasswordAuthentication.  Depending on your PAM configuration,
# PAM authentication via KbdInteractiveAuthentication may bypass
# the setting of "PermitRootLogin yes
# If you just want the PAM account and session checks to run without
# PAM authentication, then enable this but set PasswordAuthentication
# and KbdInteractiveAuthentication to 'no'.
UsePAM yes

#AllowAgentForwarding yes
#AllowTcpForwarding yes
#GatewayPorts no
X11Forwarding no
maxsessions 3
#X11DisplayOffset 10
#X11UseLocalhost yes
#PermitTTY yes
PrintMotd no
#PrintLastLog yes
#TCPKeepAlive yes
#PermitUserEnvironment no
#Compression delayed
ClientAliveInterval 60
ClientAliveCountMax 3
#UseDNS no
#PidFile /run/sshd.pid
#MaxStartups 10:30:100
#PermitTunnel no
#ChrootDirectory none
#VersionAddendum none

# no default banner path
#Banner none

# Allow client to pass locale environment variables
AcceptEnv LANG LC_*

# override default of no subsystems
Subsystem	sftp	/usr/lib/openssh/sftp-server

# Example of overriding settings on a per-user basis
#Match User anoncvs
#	X11Forwarding no
#	AllowTcpForwarding no
#	PermitTTY no
#	ForceCommand cvs server
AllowUsers <ssh-user-name>

Protocol 2

Recommended: Add your user ssh-user-name in Sudoers to allow passwordless sudo. Or you might need to update the Ansible Inventories with creds - which is not very secure.

With all that out of the way, we can dive right into the Ansible stuff. The following can be run from any machine that can reach your intended nodes over network.

Step 1: Install and setup Ansible.

We need python and ansible installed before we can get started.

# Let's create a conda env
conda create --name ansible_ceph_hq

# Activate the conda env
conda activate ansible_ceph_hq

# Install Ansible
pip install ansible

Step 2: Define an Ansible Inventory

# Let's create a folder
mkdir deploy-ceph && cd deploy-ceph

# Let's create an Invetory file
touch inventory.ini

# Create a directory for playbooks
mkdir playbooks

The contents of your inventory is simple. Refer to my example below. Please update it with your user:

[ceph_nodes]
hc-opi3b8-1 ansible_host=192.168.9.78 ansible_port=2322
hc-opi3b8-2 ansible_host=192.168.9.82 ansible_port=2322
hc-opi3b8-3 ansible_host=192.168.9.81 ansible_port=2322

[all:vars]
ansible_connection=ssh
ansible_user=<ssh-user-name>

Step 3: Ansible Playbook - HA Ceph Cluster Preinstall Checks

Let's create a file called microceph_ha_preinstall_checks.yaml that will install Snapd Daemon if not installed, and check for Raw Drives and their speeds.

---
- name: Microceph HA setup preinstall checks
  hosts: ceph_nodes
  become: yes
  tasks:
    - name: Check if snapd is installed
      package:
        name: snapd
        state: present
      register: snapd_check

    - name: Display message if snapd was installed
      debug:
        msg: "snapd was not installed, now installing."
      when: snapd_check.changed

    - name: Confirm snapd installation
      debug:
        msg: "snapd is already installed."
      when: not snapd_check.changed

    # Without this on some older version of Debian kernel - you'd not have the full mircoceph-support plugin
    # We can assume that if this playbook is running on such an OS where the Snap daemon is being installed
    # Then we also upgrade it. 
    - name: Upgrade snapd to latest version
      command: sudo snap install snapd
      when: snapd_check.changed

    - name: Gather all block devices
      command: lsblk -d -n -o NAME
      register: block_devices

    - name: Set the drive paths
      set_fact:
        drive_paths: "{{ block_devices.stdout_lines | map('regex_replace', '^', '/dev/') | list }}"

    - name: Run hdparm read speed test on specific drives
      command: /sbin/hdparm -t {{ item }}
      register: hdparm_result
      loop: "{{ drive_paths }}"
      # Filter as needed for your devices
      when: item is match('/dev/sd[a-z]') or item is match('/dev/nvme.*')

    - name: Display the hdparm output
      debug:
        var: hdparm_result.results

Run the above with the following command:

ansible-playbook -i ../inventory.ini microceph_ha_preinstall_checks.yaml

This should results like:

As you can see from the above snapshots you would see that Snapd has been checked/installed on the nodes and then you can see the Disk read speeds for all your filtered drives.

Step 4: Ansible Playbook - HA Ceph Cluster Setup

Let's create a file called microceph_ha_setup_cluster.yaml - This will bootstrap 1 node of our chosing and then form a HA cluster using the rest.

Please update the values of your hosts as needed.

---
- name: Install and hold refresh for microceph
  hosts: ceph_nodes
  become: yes
  tasks:
    - name: Ensure snapd is installed
      package:
        name: snapd
        state: present

    - name: Install microceph using snap
      command: sudo snap install microceph
      register: install_microceph
      changed_when: "'microceph' in install_microceph.stdout"

    - name: Hold snap refresh for microceph
      command: sudo snap refresh --hold microceph
      when: install_microceph is changed
      register: hold_microceph

    - name: Display install status
      debug:
        msg: "Microceph installed and refresh held."
      when: install_microceph.changed

- name: Microceph cluster setup on master node
  hosts: hc-opi3b8-1
  become: yes
  tasks:
    - name: Bootstrap the microceph cluster on the first node
      command: sudo microceph cluster bootstrap

    - name: Add second node to the cluster
      command: sudo microceph cluster add node-2
      register: add_node_2_result

    - name: Add third node to the cluster
      command: sudo microceph cluster add node-3
      register: add_node_3_result

- name: Join microceph cluster on node-2
  hosts: hc-opi3b8-2
  become: yes
  tasks:
    - name: Join the microceph cluster using the token from master node
      command: sudo microceph cluster join {{ hostvars['hc-opi3b8-1'].add_node_2_result.stdout }}
      when: hostvars['hc-opi3b8-1'].add_node_2_result is defined

- name: Join microceph cluster on node-3
  hosts: hc-opi3b8-3
  become: yes
  tasks:
    - name: Join the microceph cluster using the token from master node
      command: sudo microceph cluster join {{ hostvars['hc-opi3b8-1'].add_node_3_result.stdout }}
      when: hostvars['hc-opi3b8-1'].add_node_3_result is defined

Now run this playbook:

ansible-playbook -i ../inventory.ini microceph_ha_setup_cluster.yaml

You will see result like this:

Step 5: Ansible Playbook - Add Ceph Cluster Storage

Let's create a file called microceph_ha_add_storage.yaml - This will add OSDs/Storage devices from each node.

Please update drive names as necessary

---
- name: Add OSD storage to microceph cluster
  hosts: ceph_nodes
  become: yes
  tasks:
    - name: Add disks to microceph as OSDs with wipe option
      # Update as necessary
      command: sudo microceph disk add /dev/sda /dev/nvme0n1 --wipe

    # OPTIONAL - if your primary OS has enough space
    - name: Add loop disks to microceph as OSDs 
      # Update as necessary
      command: sudo microceph disk add loop,75G,2

Now run this playbook:

ansible-playbook -i ../inventory.ini microceph_ha_add_storage.yaml

You will see result like this:

At this point you should have a HA ceph cluster with storage added.

Step 6: Ansible Playbook - Cluster Health Check

Let's create a file called microceph_ha_check_cluster.yaml - This will show you the status of the Ceph Cluster you just created.

This can be run on any node. So feel free to update the hosts var.

---
- name: Check microceph cluster status and disk list
  # This can be run on any host that has microceph installed
  hosts: hc-opi3b8-1 
  become: yes
  tasks:
    - name: Check microceph cluster status
      command: sudo ceph status
      register: ceph_status
      changed_when: false

    - name: Display microceph cluster status
      debug:
        var: ceph_status.stdout

    - name: List microceph OSD disks
      command: sudo microceph disk list
      register: disk_list
      changed_when: false

    - name: Display microceph OSD disk list
      debug:
        var: disk_list.stdout

Now run this playbook:

ansible-playbook -i ../inventory.ini microceph_ha_check_cluster.yaml

You will see result like this:

Behold my 9Tb Data Lake that all setup by running 4 ansible playbooks. Add as many nodes you need to your inventory and go scale out to hundreds of nodes if you'd like.

Step 7: Ansible Playbook - Destroy Ceph Cluster

Let's create a file called microceph_ha_destroy_cluster.yaml - This will tear down your entire cluster.

---
- name: Destroy microceph HA cluster
  hosts: ceph_nodes
  become: yes
  tasks:
    - name: Remove microceph with purge
      command: sudo snap remove microceph --purge
      register: remove_microceph
      changed_when: "'microceph removed' in remove_microceph.stdout or remove_microceph.stderr"

    - name: Confirm microceph removal
      debug:
        msg: "Microceph has been successfully removed from {{ inventory_hostname }}."
      when: remove_microceph is changed

Now run this playbook:

ansible-playbook -i ../inventory.ini microceph_ha_destroy_cluster.yaml

You will see result like this:

Conclusion

And there you have it—a streamlined, high availability Ceph cluster deployed and managed with the power of Ansible! By automating the deployment process, you've not only saved time but also ensured consistency and scalability in your storage setup. As your storage needs grow, your Ansible playbooks will make expanding and managing your cluster a breeze. Happy clustering, and may your data always be resilient and available!

abasu0713/microceph-ha-ansible-playbooks.md

Introduction to Ansible Playbooks: Deploying a Rock-Solid High Availability Ceph Cluster

Prerequisites

(Optional) Security Hardening

Step 1: Install and setup Ansible.

Step 2: Define an Ansible Inventory

Step 3: Ansible Playbook - HA Ceph Cluster Preinstall Checks

Step 4: Ansible Playbook - HA Ceph Cluster Setup

Step 5: Ansible Playbook - Add Ceph Cluster Storage

Step 6: Ansible Playbook - Cluster Health Check

Step 7: Ansible Playbook - Destroy Ceph Cluster

Conclusion