I recently was consulted on a Ceph Cluster running into nearfull and backfillfull for the first time. One Ceph OSD was utilized over 85% and another over 90%. The operators were unaware of the meaning and what to do about it, so took a look.

Looking at ceph status and ceph df, I noticed something. Try to spot it yourself – I made it easier by removing some stuff around it:

$ ceph status
[...]
    health: HEALTH_WARN
            1 pools have many more objects per pg than average
            1 backfillfull osd(s)
            1 nearfull osd(s)
            Low space hindering backfill (add storage if this doesn't resolve itself): 1 pg backfill_toofull
            1 pgs not deep-scrubbed in time
            1 pgs not scrubbed in time
            20 pool(s) backfillfull

  services:
[...]
    osd: 96 osds: 96 up (since 4w), 96 in (since 12M); 1 remapped pgs
[...]
  data:
    volumes: 4/4 healthy
    pools:   20 pools, 4769 pgs
    objects: 31.90M objects, 117 TiB
    usage:   351 TiB used, 522 TiB / 873 TiB avail
    pgs:     299136/95689743 objects misplaced (0.313%)
             4763 active+clean
             5    active+clean+scrubbing+deep
             1    active+remapped+backfill_toofull
[...]
# ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    873 TiB  522 TiB  351 TiB   351 TiB      40.19
TOTAL  873 TiB  522 TiB  351 TiB   351 TiB      40.19

--- POOLS ---
POOL                           ID   PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
device_health_metrics           1  4096  907 MiB      108  2.7 GiB      0     14 TiB
rbd                             4   145   97 TiB   25.39M  290 TiB  87.43     14 TiB
[...]

The raw usage was only at 40%. Why would one disk contain so much data? The balancer was in upmap mode and active. But even with no balancer, this kind of miss-balancing would be extreme and very unlikely.

You may have already spotted something odd in the Ceph Pool configuration. While device_health_metrics contained less than 1GiB, it had 4096 PGs. At the same time rbd contained 97TiB in just 145 PGs.

145 is not just no power of two (which would usually produce a Ceph Warning), but also way to low for the about of data and Ceph OSD count.

What Does This Mean for Storage Distribution?

Estimating the size of one PG for pool rbd yields about 685GiB (97TiB/145). How many (average sized) PGs will lead to utilization of one disk over 85%?

About 11.4 (85% * 9TiB / 685GiB)

Unfortunately, not every PG is the same size. Looking at the sizes, multiple PG exceed 800GiB. Furthermore not every Ceph OSD receives the same amount of PGs. And as we will soon see the number of PGs was trying to get lower.

Cause and Distributing Data

But what actually caused the bogus PG numbers? The answer is: The PG Autoscaler. For some reason only device_health_metrics set a target_size_ratio to 0.1. This lead to the effective ratio to be 1 for this pool. Apparently the autoscaling assumed this would mean all data would be stored in this pool. This also explained why the number of PGs was not a power of two. The autoscaler set target_pg_num to 32 to reduce the pool rbd even more. This was why there was not Ceph Warning. This also means that if there were no disks were running full right now, it certainly would have happened in the following days.

Before removing the ratio, I wanted to know what would happen. I disabled autoscaling (ceph osd pool set noautoscale) and removed the target ratio:

# ceph osd pool set device_health_metrics target_size_ratio 0
# ceph osd pool autoscale-status
POOL                             SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK
device_health_metrics          906.5M                3.0        873.1T  0.0000                                  1.0    4096           1  on         False
rbd                            99048G                3.0        873.1T  0.3323                                  1.0      32        1024  on         False
[...]

This was a lot better and we decided to re-enable autoscaling (ceph osd pool unset noautoscale) right away.

After a few minutes the backfillfull was gone. Soon to be followed by the nearfull. After a few days of rebalancing both pools had the proper PG count.

I am not sure why the ratio was set and why it was interpreted as it was. The docs suggest this would not be a problem and I could not reproduce the behavior in a more recent version of Ceph. So this was potentially fixed already.