Looking to build a high availability Ceph cluster with ease? Ansible Playbooks have your back! Whether you're scaling out storage for a home lab or enterprise setup, automating your Ceph deployment is key to reliability and efficiency. In this guide, I'll walk you through step-by-step how to set up a resilient, high availability Ceph cluster using Ansible Playbooks—so you can focus on your data, not the details. Let's get your cluster up and running like a pro!
- This gist covers basic Ansible Playbooks for simplifying the redundant tasks for provisioning your Ceph Clusters.
- We will be using Microceph to build a High Availability(HA) Ceph Cluster.
- A bare minimum of 3 Linux based nodes that support Snapd. These can be virtual machines, or bare-metal systems that live within a network boundary where they can reach each other.
For example: I am doing this on 3 Orange Pi 3Bs that I have at home. Each have a self-compiled Linux Image that has kernel support for Ceph and RBD. Please refer to my Medium article for more information.
I do want to point out that these are OPi 3B 8G V1 boards, which are still being sold on Amazon. But they have reached EOS and taken off shelf. Please don't buy them before doing research. There are kernel and hardware changes that will break either u-boot or Wifi and Ethernet based on the version of the base linux kernel you'd be compiling your image with - for Ceph/RBD support if you wish to use Armbian.
The Orange Pi 3B has issues with both Joshua Riek's Ubuntu images and OPi's official images where for some reason there is a constant load on your core processors even during idle time. I didn't think using an ISO where you lose 25 percent of your compute power was very appealing and hence went the Armbian route where you don't have that issue.
Please do not use SBCs for Production deployments. Feel free to use a HA Proxomox Cluster or a Metal3 based topology on more robust server grade systems for production deployments.
In this section we are going to do some of the following security hardening for our bare machine nodes.
Refer to this redhat documentation for securing ssh for more details. Just create a
sshd_config
file and copy them over to each node at/etc/ssh/
and then restart the ssh service:sudo systemctl restart ssh
- Updated SSH configs to only allow a single user with no password login and only Authorized Key logins.
- Change the default port for the ssh server to some arbitrary value.
- Don't allow Root Login
- Use Protocol 2
- Set some max concurrent sessions to a low value like 2.
Sample sshd_config
. Please make necessary adjustments (like for Port
, AllowUsers <ssh-user-name>
, etc) before restarting the ssh service.
# This is the sshd server system-wide configuration file. See
# sshd_config(5) for more information.
# This sshd was compiled with PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
# The strategy used for options in the default sshd_config shipped with
# OpenSSH is to specify options with their default value where
# possible, but leave them commented. Uncommented options override the
# default value.
Include /etc/ssh/sshd_config.d/*.conf
Port 2322
#AddressFamily any
#ListenAddress 0.0.0.0
#ListenAddress ::
#HostKey /etc/ssh/ssh_host_rsa_key
#HostKey /etc/ssh/ssh_host_ecdsa_key
#HostKey /etc/ssh/ssh_host_ed25519_key
# Ciphers and keying
#RekeyLimit default none
# Logging
#SyslogFacility AUTH
#LogLevel INFO
# Authentication:
#LoginGraceTime 2m
PermitRootLogin no
#StrictModes yes
#MaxAuthTries 6
#MaxSessions 10
PubkeyAuthentication yes
# Expect .ssh/authorized_keys2 to be disregarded by default in future.
#AuthorizedKeysFile .ssh/authorized_keys .ssh/authorized_keys2
#AuthorizedPrincipalsFile none
#AuthorizedKeysCommand none
#AuthorizedKeysCommandUser nobody
# For this to work you will also need host keys in /etc/ssh/ssh_known_hosts
#HostbasedAuthentication no
# Change to yes if you don't trust ~/.ssh/known_hosts for
# HostbasedAuthentication
#IgnoreUserKnownHosts no
# Don't read the user's ~/.rhosts and ~/.shosts files
#IgnoreRhosts yes
# To disable tunneled clear text passwords, change to no here!
PasswordAuthentication no
PermitEmptyPasswords no
# Change to yes to enable challenge-response passwords (beware issues with
# some PAM modules and threads)
KbdInteractiveAuthentication no
# Kerberos options
#KerberosAuthentication no
#KerberosOrLocalPasswd yes
#KerberosTicketCleanup yes
#KerberosGetAFSToken no
# GSSAPI options
#GSSAPIAuthentication no
#GSSAPICleanupCredentials yes
#GSSAPIStrictAcceptorCheck yes
#GSSAPIKeyExchange no
# Set this to 'yes' to enable PAM authentication, account processing,
# and session processing. If this is enabled, PAM authentication will
# be allowed through the KbdInteractiveAuthentication and
# PasswordAuthentication. Depending on your PAM configuration,
# PAM authentication via KbdInteractiveAuthentication may bypass
# the setting of "PermitRootLogin yes
# If you just want the PAM account and session checks to run without
# PAM authentication, then enable this but set PasswordAuthentication
# and KbdInteractiveAuthentication to 'no'.
UsePAM yes
#AllowAgentForwarding yes
#AllowTcpForwarding yes
#GatewayPorts no
X11Forwarding no
maxsessions 3
#X11DisplayOffset 10
#X11UseLocalhost yes
#PermitTTY yes
PrintMotd no
#PrintLastLog yes
#TCPKeepAlive yes
#PermitUserEnvironment no
#Compression delayed
ClientAliveInterval 60
ClientAliveCountMax 3
#UseDNS no
#PidFile /run/sshd.pid
#MaxStartups 10:30:100
#PermitTunnel no
#ChrootDirectory none
#VersionAddendum none
# no default banner path
#Banner none
# Allow client to pass locale environment variables
AcceptEnv LANG LC_*
# override default of no subsystems
Subsystem sftp /usr/lib/openssh/sftp-server
# Example of overriding settings on a per-user basis
#Match User anoncvs
# X11Forwarding no
# AllowTcpForwarding no
# PermitTTY no
# ForceCommand cvs server
AllowUsers <ssh-user-name>
Protocol 2
Recommended: Add your user
ssh-user-name
in Sudoers to allow passwordless sudo. Or you might need to update the Ansible Inventories with creds - which is not very secure.
With all that out of the way, we can dive right into the Ansible stuff. The following can be run from any machine that can reach your intended nodes over network.
We need python and ansible installed before we can get started.
# Let's create a conda env
conda create --name ansible_ceph_hq
# Activate the conda env
conda activate ansible_ceph_hq
# Install Ansible
pip install ansible
# Let's create a folder
mkdir deploy-ceph && cd deploy-ceph
# Let's create an Invetory file
touch inventory.ini
# Create a directory for playbooks
mkdir playbooks
The contents of your inventory is simple. Refer to my example below. Please update it with your user:
[ceph_nodes]
hc-opi3b8-1 ansible_host=192.168.9.78 ansible_port=2322
hc-opi3b8-2 ansible_host=192.168.9.82 ansible_port=2322
hc-opi3b8-3 ansible_host=192.168.9.81 ansible_port=2322
[all:vars]
ansible_connection=ssh
ansible_user=<ssh-user-name>
Let's create a file called microceph_ha_preinstall_checks.yaml
that will install Snapd Daemon if not installed, and check for Raw Drives and their speeds.
---
- name: Microceph HA setup preinstall checks
hosts: ceph_nodes
become: yes
tasks:
- name: Check if snapd is installed
package:
name: snapd
state: present
register: snapd_check
- name: Display message if snapd was installed
debug:
msg: "snapd was not installed, now installing."
when: snapd_check.changed
- name: Confirm snapd installation
debug:
msg: "snapd is already installed."
when: not snapd_check.changed
# Without this on some older version of Debian kernel - you'd not have the full mircoceph-support plugin
# We can assume that if this playbook is running on such an OS where the Snap daemon is being installed
# Then we also upgrade it.
- name: Upgrade snapd to latest version
command: sudo snap install snapd
when: snapd_check.changed
- name: Gather all block devices
command: lsblk -d -n -o NAME
register: block_devices
- name: Set the drive paths
set_fact:
drive_paths: "{{ block_devices.stdout_lines | map('regex_replace', '^', '/dev/') | list }}"
- name: Run hdparm read speed test on specific drives
command: /sbin/hdparm -t {{ item }}
register: hdparm_result
loop: "{{ drive_paths }}"
# Filter as needed for your devices
when: item is match('/dev/sd[a-z]') or item is match('/dev/nvme.*')
- name: Display the hdparm output
debug:
var: hdparm_result.results
Run the above with the following command:
ansible-playbook -i ../inventory.ini microceph_ha_preinstall_checks.yaml
This should results like:




As you can see from the above snapshots you would see that Snapd has been checked/installed on the nodes and then you can see the Disk read speeds for all your filtered drives.
Let's create a file called microceph_ha_setup_cluster.yaml
- This will bootstrap 1 node of our chosing and then form a HA cluster using the rest.
Please update the values of your hosts as needed.
---
- name: Install and hold refresh for microceph
hosts: ceph_nodes
become: yes
tasks:
- name: Ensure snapd is installed
package:
name: snapd
state: present
- name: Install microceph using snap
command: sudo snap install microceph
register: install_microceph
changed_when: "'microceph' in install_microceph.stdout"
- name: Hold snap refresh for microceph
command: sudo snap refresh --hold microceph
when: install_microceph is changed
register: hold_microceph
- name: Display install status
debug:
msg: "Microceph installed and refresh held."
when: install_microceph.changed
- name: Microceph cluster setup on master node
hosts: hc-opi3b8-1
become: yes
tasks:
- name: Bootstrap the microceph cluster on the first node
command: sudo microceph cluster bootstrap
- name: Add second node to the cluster
command: sudo microceph cluster add node-2
register: add_node_2_result
- name: Add third node to the cluster
command: sudo microceph cluster add node-3
register: add_node_3_result
- name: Join microceph cluster on node-2
hosts: hc-opi3b8-2
become: yes
tasks:
- name: Join the microceph cluster using the token from master node
command: sudo microceph cluster join {{ hostvars['hc-opi3b8-1'].add_node_2_result.stdout }}
when: hostvars['hc-opi3b8-1'].add_node_2_result is defined
- name: Join microceph cluster on node-3
hosts: hc-opi3b8-3
become: yes
tasks:
- name: Join the microceph cluster using the token from master node
command: sudo microceph cluster join {{ hostvars['hc-opi3b8-1'].add_node_3_result.stdout }}
when: hostvars['hc-opi3b8-1'].add_node_3_result is defined
Now run this playbook:
ansible-playbook -i ../inventory.ini microceph_ha_setup_cluster.yaml
You will see result like this:


Let's create a file called microceph_ha_add_storage.yaml
- This will add OSDs/Storage devices from each node.
Please update drive names as necessary
---
- name: Add OSD storage to microceph cluster
hosts: ceph_nodes
become: yes
tasks:
- name: Add disks to microceph as OSDs with wipe option
# Update as necessary
command: sudo microceph disk add /dev/sda /dev/nvme0n1 --wipe
# OPTIONAL - if your primary OS has enough space
- name: Add loop disks to microceph as OSDs
# Update as necessary
command: sudo microceph disk add loop,75G,2
Now run this playbook:
ansible-playbook -i ../inventory.ini microceph_ha_add_storage.yaml
You will see result like this:

At this point you should have a HA ceph cluster with storage added.
Let's create a file called microceph_ha_check_cluster.yaml
- This will show you the status of the Ceph Cluster you just created.
This can be run on any node. So feel free to update the
hosts
var.
---
- name: Check microceph cluster status and disk list
# This can be run on any host that has microceph installed
hosts: hc-opi3b8-1
become: yes
tasks:
- name: Check microceph cluster status
command: sudo ceph status
register: ceph_status
changed_when: false
- name: Display microceph cluster status
debug:
var: ceph_status.stdout
- name: List microceph OSD disks
command: sudo microceph disk list
register: disk_list
changed_when: false
- name: Display microceph OSD disk list
debug:
var: disk_list.stdout
Now run this playbook:
ansible-playbook -i ../inventory.ini microceph_ha_check_cluster.yaml
You will see result like this:

Behold my 9Tb Data Lake that all setup by running 4 ansible playbooks. Add as many nodes you need to your inventory and go scale out to hundreds of nodes if you'd like.
Let's create a file called microceph_ha_destroy_cluster.yaml
- This will tear down your entire cluster.
---
- name: Destroy microceph HA cluster
hosts: ceph_nodes
become: yes
tasks:
- name: Remove microceph with purge
command: sudo snap remove microceph --purge
register: remove_microceph
changed_when: "'microceph removed' in remove_microceph.stdout or remove_microceph.stderr"
- name: Confirm microceph removal
debug:
msg: "Microceph has been successfully removed from {{ inventory_hostname }}."
when: remove_microceph is changed
Now run this playbook:
ansible-playbook -i ../inventory.ini microceph_ha_destroy_cluster.yaml
You will see result like this:
And there you have it—a streamlined, high availability Ceph cluster deployed and managed with the power of Ansible! By automating the deployment process, you've not only saved time but also ensured consistency and scalability in your storage setup. As your storage needs grow, your Ansible playbooks will make expanding and managing your cluster a breeze. Happy clustering, and may your data always be resilient and available!