Deploying Ceph Reef cluster using cephadm

The last time I had a chance to work with Ceph was the release of Nautilus (14), several years ago.

Since then, some aspects have changed in the procedure for creating and managing the Ceph cluster.

In this article, I plan to refresh my knowledge on deploying Ceph based on Reef release (18) as an example.

First, let’s try to decide on the system requirements. It could require more than one article about Ceph sizing, but briefly:

CPU:

Monitor Service is not an active consumer of processor resources. A minimum of 2 cores is recommended.

It is crucial to take into account that other services can and, most likely, will run on this host, for example, Manager, which will also require some processor resources;

OSD – 1 core per BlueStore OSD service. For intense loads, you may need 2 cores. One OSD service is usually one disk on which data is located. If the server has 12 drives, it most likely will be 12 OSD services.

RAM:

Monitor – the more – the better. In a medium-sized cluster, 64GB per server will be enough. In large clusters with more than 300 OSDs – 128GB;

OSD – from 4 to 8GB of RAM per BlueStore OSD service, where 4GB is an acceptable minimum.

Drives:

The rules here are general.

Typically, one disk is one OSD service, although it is possible to run several OSD services on one disk, but this is highly not recommended for performance reasons.

It is recommended to use disks with a capacity of at least 1TB, taking into account the cost-per-gigabyte ratio.

The number of disks per server and their size must be selected carefully because the failure domain should be taken into account. A server with a large number of 8TB disks may cost less, but the failure of such a server will be much more unpleasant than the failure of a server with fewer disks and less capacity.

It is advisable to install several SSDs on the server and place WAL+DB on them to increase performance.

Network:

The recommendation here is simple – at least 10Gbps, more is better. A fast network provides better data access speed, replication, and recovery. It needs to understand that a server with 12 HDDs may not need 25Gbps network speed. I don’t think there’s much need to explain.

It is advisable to consider organizing a separate network for client access and a network for cluster tasks.

Cluster configuration:

The first thing you should decide on is the number of Monitor services. The basic rule is that more than 50% of services must be available to form a quorum. Therefore, making an even total number does not make sense.

For example, there are two clusters – one has 5 monitors, and the other has 6. If two are lost, the first remains with 3 monitors (>50%), and the second remains with 4 monitors (>50%). With the loss of the third monitor in the first cluster, the total number is already <50% (5-3=2), while in the second cluster, it is =50% of the original (6-3=3).

Since more than 50% of available services are required to form a quorum, neither the first nor the second cluster will work, while the second cluster will be allocated more resources.

In general, the number of monitors depends on the cluster size. In a productive environment, I would consider 5, and for small clusters 3. In my practice, I periodically encountered situations, when one monitor failed, another one disappeared for a short time. In this situation, a cluster with 5 services remains viable, while a cluster with 3 monitors is not.

Now about OSD.

It is worth deciding on the number of OSDs per server. It is impossible to give a clear answer here, but you should always remember that the larger the server, the longer the recovery procedures will take in the event of its failure and the longer the risk will persist that something else will fail at the moment.

Thus, we gradually arrive at which Replication Factor should be used. Replication Factor – the number of copies of our data in the cluster. Depending on the settings, Ceph monitors the placement of additional copies. They can be distributed both between servers and, for example, between racks in a data center.

RF-1 – one copy of data. If any disk or server in the cluster fails, the data is irretrievably lost;

RF-2 – two copies of data, respectively. The server has failed, a second copy of the data is available on another server. The data is saved and available to users, and the process of creating a second copy on other servers begins;

RF-3 – three copies of data in three different places.

What to choose? We will not consider RF-1 at all, because we value our data.

When using RF-2 and RF-3 or even more, you need to understand that 100GB of data in the first case will take 200GB in the cluster, and in the second – 300GB (Yes, you can use Erasure Coding, but that’s not about that now).

Many unprepared people are often shocked by the amount of space consumed for second and third copies. “We bought three servers of 50TB each, why can we only load 50TB of data on them?”

Then they say – let’s use RF-2 because the space is a pity. Then one server fails, for some reason two more disks on other servers died during the replication process, and the data is corrupted.

It is difficult to answer the replication factor to be used without any requirements and information about the cost of the hosted data, but in most cases, I would consider RF-3, i.e. storing data in triplicate, since RF-3 significantly increases the chances of surviving cascading failures.

We talked a little about what resources are needed to run a Ceph cluster, what number of services to use, and also a little about the Replication Factor.

It’s time to get started with the installation.

My stand:

I have 6 virtual machines running Rocky Linux 9.3 participating in the test. I plan to use 3 machines for the Monitor and Manager system services, and the second half for OSD. For machines with OSD, 3 additional disks have been added.

DNS records have been created for all hosts, all machines communicate using both a short name and an FQDN. In this case, the hostname uses a short name, not an FQDN (this is important).

The first step is to install Python and Docker, or Podman if desired, on all servers. Ceph services now run in containers. One container per service, respectively.

Since Rocky Linux grows from Red Hat, Podman will be preferable and it is also available in the standard repositories.

We perform this operation on all machines that we plan to use in Ceph:

[root@ceph-node1/2/3/4/5/6 ~]# dnf install python3
[root@ceph-node1/2/3/4/5/6 ~]# python3 --version
Python 3.9.18

[root@ceph-node1/2/3/4/5/6 ~]# dnf install podman
[root@ceph-node1/2/3/4/5/6 ~]# podman -v
podman version 4.6.1

All further operations are performed from the first machine. In my case, it is ceph-mon-01.

First of all, let’s install the cephadm utility, which will be used to create and configure the cluster.

Specify Ceph Release version and download cephadm:

[root@ceph-mon-01 ~]# CEPH_RELEASE=18.2.1
[root@ceph-mon-01 ~]# curl --silent --remote-name --location https://download.ceph.com/rpm-${CEPH_RELEASE}/el9/noarch/cephadm
[root@ceph-mon-01 ~]# chmod +x cephadm

Add Reef repositories:

[root@ceph-mon-01 ~]# ./cephadm add-repo --release reef
Writing repo to /etc/yum.repos.d/ceph.repo...
Enabling EPEL...
Completed adding repo.

Install cephadm:

[root@ceph-mon-01 ~]# ./cephadm install
Installing packages ['cephadm']...

Please note that this specifies the release of Ceph that you plan to use, in my case the latest one is Reef.

Let’s start creating the cluster by adding the first Ceph Monitor service using cephadm bootstrap.

With this command we install the Ceph monitor, mgr, dashboard, etc. services, configure the Firewall rules, and create a basic (almost empty) ceph.conf configuration file.

I perform the procedure from the first node:

[root@ceph-mon-01 ~]# cephadm bootstrap --mon-ip 10.10.11.13
Creating directory /etc/ceph for ceph.conf
Verifying podman|docker is present...
Verifying lvm2 is present...
Verifying time synchronization is in place...
Unit chronyd.service is enabled and running
Repeating the final host check...
podman (/usr/bin/podman) version 4.6.1 is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
Internal network (--cluster-network) has not been provided, OSD replication will default to the public_network
...

In this case, we set the IP address of the node where the corresponding service should be located.

Note that in a production environment, it is good practice to use a separate network for the inter-OSD data replication.

You can specify this network at the stage of creating a cluster using the –cluster-network key. For example:

[root@ceph-mon-01 ~]# cephadm bootstrap --mon-ip 10.10.11.13 --cluster-network 192.168.1.0/24

After completing this operation, a “basic cluster” of one monitor will be created.

You will immediately be provided with the address and credentials to access Ceph Dashboard:

Ceph Dashboard is now available at:
            URL: https://ceph-mon-01:8443/
            User: admin
        Password: 2sd2nk9asdsad

In this case, a number of containers will be launched on the node:

[root@ceph-mon-01 ~]# podman ps
CONTAINER ID  IMAGE                                                                                      COMMAND               CREATED             STATUS             PORTS       NAMES
91bd31bf4d3e  quay.io/ceph/ceph:v18                                                                      -n mon.ceph-mon-0...  2 minutes ago       Up 2 minutes                   ceph-24c20e62-c4da-11ee-ba95-005056aad62a-mon-ceph-mon-01
9b61113d8793  quay.io/ceph/ceph:v18                                                                      -n mgr.ceph-mon-0...  About a minute ago  Up About a minute              ceph-24c20e62-c4da-11ee-ba95-005056aad62a-mgr-ceph-mon-01-czyfjm
9a05839af09b  quay.io/ceph/ceph@sha256:a4e86c750cc11a8c93453ef5682acfa543e3ca08410efefa30f520b54f41831f  -n client.ceph-ex...  56 seconds ago      Up 56 seconds                  ceph-24c20e62-c4da-11ee-ba95-005056aad62a-ceph-exporter-ceph-mon-01
4e0cf5637318  quay.io/ceph/ceph@sha256:a4e86c750cc11a8c93453ef5682acfa543e3ca08410efefa30f520b54f41831f  -n client.crash.c...  54 seconds ago      Up 54 seconds                  ceph-24c20e62-c4da-11ee-ba95-005056aad62a-crash-ceph-mon-01
e0aef0f63ded  quay.io/prometheus/node-exporter:v1.5.0                                                    --no-collector.ti...  47 seconds ago      Up 48 seconds                  ceph-24c20e62-c4da-11ee-ba95-005056aad62a-node-exporter-ceph-mon-01
568217a31e2d  quay.io/prometheus/alertmanager:v0.25.0                                                    --cluster.listen-...  36 seconds ago      Up 36 seconds                  ceph-24c20e62-c4da-11ee-ba95-005056aad62a-alertmanager-ceph-mon-01
ac144fe2cb23  quay.io/ceph/ceph-grafana:9.4.7                                                            /bin/bash             14 seconds ago      Up 14 seconds                  ceph-24c20e62-c4da-11ee-ba95-005056aad62a-grafana-ceph-mon-01
9e5dac2859e5  quay.io/prometheus/prometheus:v2.43.0                                                      --config.file=/et...  1 second ago        Up 1 second                    ceph-24c20e62-c4da-11ee-ba95-005056aad62a-prometheus-ceph-mon-01

To view the status of the cluster and continue configuring it, install the additional ceph-common package:

[root@ceph-mon-01 ~]# cephadm install ceph-common
[root@ceph-mon-01 ~]# ceph -s
  cluster:
    id:     24c20e62-c4da-11ee-ba95-005056aad62a
    health: HEALTH_WARN
            OSD count 0 < osd_pool_default_size 3

  services:
    mon: 1 daemons, quorum ceph-mon-01 (age 6m)
    mgr: ceph-mon-01.czyfjm(active, since 4m)
    osd: 0 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:

Based on the cluster status, we see the presence of one monitor, one mgr, and zero OSDs.

Now let’s add two more monitors to our cluster, bringing their total number to three.

Since the configuration is performed from the first node, we will provide passwordless SSH access from the first node to two other monitors by copying the certificate from /etc/ceph:

[root@ceph-mon-01 ~]# ssh-copy-id -f -i /etc/ceph/ceph.pub root@ceph-mon-02
[root@ceph-mon-01 ~]# ssh-copy-id -f -i /etc/ceph/ceph.pub root@ceph-mon-03

Add two hosts to the cluster:

[root@ceph-mon-01 ~]# ceph orch host add ceph-mon-02
Added host 'ceph-mon-02' with addr '10.10.11.14'

[root@ceph-mon-01 ~]# ceph orch host add ceph-mon-03
Added host 'ceph-mon-03' with addr '10.10.11.15'

Let’s look at the list of nodes in the cluster:

[root@ceph-mon-01 ~]# ceph orch host ls
HOST         ADDR         LABELS  STATUS
ceph-mon-01  10.10.11.13  _admin
ceph-mon-02  10.10.11.14
ceph-mon-03  10.10.11.15
3 hosts in cluster

On the added nodes we can run various Ceph services.

Let’s start the Monitor service on the newly added nodes:

[root@ceph-mon-01 ~]# ceph orch daemon add mon ceph-mon-02
Deployed mon.ceph-mon-02 on host 'ceph-mon-02'

[root@ceph-mon-01 ~]# ceph orch daemon add mon ceph-mon-03
Deployed mon.ceph-mon-03 on host 'ceph-mon-03'

You may see an error:

Error EINVAL: name mon.ceph-mon-02 already in use
Error EINVAL: name mon.ceph-mon-03 already in use

A little more about it later, but if you faced this error, you are free to move to the next step. It’s mean, that Ceph-Mon daemons automatically deployed on those nodes.

Let’s check the cluster status and make sure that the total number of monitors is three:

[root@ceph-mon-01 ~]# ceph -s
  cluster:
    id:     24c20e62-c4da-11ee-ba95-005056aad62a
    health: HEALTH_WARN
            OSD count 0 < osd_pool_default_size 3

  services:
    mon: 3 daemons, quorum ceph-mon-01,ceph-mon-02,ceph-mon-03 (age 4m)
    mgr: ceph-mon-01.czyfjm(active, since 14m), standbys: ceph-mon-02.iyewzj
    osd: 0 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:

You can also notice automatically deployed Mgr services. By the way, they work in active/standby mode. Only one of them is always active.

Now let’s add OSD servers responsible for storing data.

Similarly, we place certificates for access via SSH:

[root@ceph-mon-01 ~]# ssh-copy-id -f -i /etc/ceph/ceph.pub root@ceph-osd-01
[root@ceph-mon-01 ~]# ssh-copy-id -f -i /etc/ceph/ceph.pub root@ceph-osd-02
[root@ceph-mon-01 ~]# ssh-copy-id -f -i /etc/ceph/ceph.pub root@ceph-osd-03

First of all, we add nodes to the cluster:

[root@ceph-mon-01 ~]# ceph orch host add ceph-osd-01
Added host 'ceph-osd-01' with addr '10.10.11.16'

[root@ceph-mon-01 ~]# ceph orch host add ceph-osd-02
Added host 'ceph-osd-02' with addr '10.10.11.17'

[root@ceph-mon-01 ~]# ceph orch host add ceph-osd-03
Added host 'ceph-osd-03' with addr '10.10.11.18'

If you look at the list of hosts now, it will, as expected, become larger:

[root@ceph-mon-01 ~]# ceph orch host ls
HOST         ADDR         LABELS  STATUS
ceph-mon-01  10.10.11.13  _admin
ceph-mon-02  10.10.11.14
ceph-mon-03  10.10.11.15
ceph-osd-01  10.10.11.16
ceph-osd-02  10.10.11.17
ceph-osd-03  10.10.11.18
6 hosts in cluster

Then we launch the OSD services on these nodes. After the node hostname, we specify the disks that will be used to store data:

[root@ceph-mon-01 ~]# ceph orch daemon add osd ceph-osd-01:/dev/sdb,/dev/sdc,/dev/sdd
Created osd(s) 0,1,2 on host 'ceph-osd-01'

[root@ceph-mon-01 ~]# ceph orch daemon add osd ceph-osd-02:/dev/sdb,/dev/sdc,/dev/sdd
Created osd(s) 3,4,5 on host 'ceph-osd-02'

[root@ceph-mon-01 ~]# ceph orch daemon add osd ceph-osd-03:/dev/sdb,/dev/sdc,/dev/sdd
Created osd(s) 6,7,8 on host 'ceph-osd-03

This is the simplest option for my lab environment. In a production environment where it is recommended to move DB and WAL to an SSD, the following command may be useful:

ceph orch daemon add osd host:data_devices=/dev/sdb,/dev/sdc,db_devices=/dev/sdd,osds_per_device=2

Where sdb and sdc are hard drives, and sdd are solid state drives.

Let’s check the status of the added OSDs:

[root@ceph-mon-01 ~]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME             STATUS  REWEIGHT  PRI-AFF
-1         0.26367  root default
-3         0.08789      host ceph-osd-01
 0    hdd  0.02930          osd.0             up   1.00000  1.00000
 1    hdd  0.02930          osd.1             up   1.00000  1.00000
 2    hdd  0.02930          osd.2             up   1.00000  1.00000
-5         0.08789      host ceph-osd-02
 3    hdd  0.02930          osd.3             up   1.00000  1.00000
 4    hdd  0.02930          osd.4             up   1.00000  1.00000
 5    hdd  0.02930          osd.5             up   1.00000  1.00000
-7         0.08789      host ceph-osd-03
 6    hdd  0.02930          osd.6             up   1.00000  1.00000
 7    hdd  0.02930          osd.7             up   1.00000  1.00000
 8    hdd  0.02930          osd.8             up   1.00000  1.00000

Three hosts, each running three Ceph OSD services.

WEIGHT determines the priority of data placement on the disk. Disks with a higher value will have higher priority for storing the data. When adding an OSD, the WEIGHT value is set automatically and corresponds to the size of the disk. A 1TB disk will have WEIGHT = 1.

Host weight = total weight of OSDs associated with this host. However, no one bothers you to distribute the weight yourself in the future (but why?).

Now the cluster status displays the available space for storing data, as well as the number and status of the OSD:

[root@ceph-mon-01 ~]# ceph -s
  cluster:
    id:     24c20e62-c4da-11ee-ba95-005056aad62a
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum ceph-mon-01,ceph-mon-02,ceph-mon-03,ceph-osd-01,ceph-osd-02 (age 17m)
    mgr: ceph-mon-01.czyfjm(active, since 35m), standbys: ceph-mon-02.iyewzj
    osd: 9 osds: 9 up (since 71s), 9 in (since 113s)

  data:
    pools:   1 pools, 1 pgs
    objects: 2 objects, 449 KiB
    usage:   641 MiB used, 269 GiB / 270 GiB avail
    pgs:     1 active+clean

An attentive reader will ask – why 5 monitors, and not 3 as was originally specified? Good question. Let’s fix it:

[root@ceph-mon-01 ~]# ceph orch apply mon --placement="3 ceph-mon-01 ceph-mon-02 ceph-mon-03"
Scheduled mon update...

[root@ceph-mon-01 ~]# ceph orch apply mgr --placement="3 ceph-mon-01 ceph-mon-02 ceph-mon-03"
Scheduled mgr update...

Using the commands above, we specified the required number of Mon and Mgr services and also indicated the hosts where these services should be located.

In this case, I want to mention the convenience of running services in containers. In the current version of Ceph, you can easily operate services and nodes where these services should run.

The answer to the question of why there were 5 monitors is simple and obvious – the default value for cephadm. That’s why you may face an error in the beginning – Ceph automatically deployed new Monitors on available nodes to comply with the policy.

Let’s check the services:

services:
    mon: 3 daemons, quorum ceph-mon-01,ceph-mon-02,ceph-mon-03 (age 14s)
    mgr: ceph-mon-01.czyfjm(active, since 36m), standbys: ceph-mon-02.iyewzj, ceph-mon-03.xcpobs
    osd: 9 osds: 9 up (since 2m), 9 in (since 3m)

Now everything is as it should be – the number of monitors is three. True, after this operation I encountered one bug and a warning:

1 stray daemon(s) not managed by cephadm

The solution is simple: reboot the active ceph-mgr service. In my case on 1 node:

[root@ceph-mon-01 ~]# ceph orch daemon restart mgr.ceph-mon-01.czyfjm
Scheduled to restart mgr.ceph-mon-01.czyfjm on host 'ceph-mon-01'

You may notice, that the active mgr is moved to another node.

That’s all, actually. Running a small Ceph cluster is not a difficult task.

It remains to decide on the protocols by which disk resources are planned to be allocated:

Block access using RBD (Rados Block Device);
File-based using CephFS;
Object (S3) using RadosGW.

But more on that another time.

Loading

Leave a Reply

Your email address will not be published. Required fields are marked *