Cluster Available Capacity Alerts

Overview

This KB covers the relevant information for the following alerts:

Cluster physical space reached 95.0%
Cluster remaining_stripes_state changed from ABUNDANT to SCARCE

The alerts usually mean that the system’s physical capacity is getting full and we need to address it soon to avoid halt writes situations.

Background

The system contains DBoxes, which contain drives, which are organized as part of raid groups.
The system is writing data at a stripe level, which is a piece of the RAID structure.
While the system is getting Scarce due to capacity, it means that the available stripes in the system are under 15%(in 5.5%, the cluster will be read-only - halt writes state)
When the system has a low number of available stripes, the system prioritizes background resources to try to release space in the system and remain production-ready, which may also lead to temporary performance degradation.

The flow in the system that is responsible for releasing stripes is the Defrag process, which is responsible for recognizing fragmented stripe groups and rewriting them more efficiently, and as a result, freeing stripes in the system.
The target of the Defrag process is to decrease the gap between the available stripes in the system to the available capacity - for example, if the system capacity usage is 85%(15% available capacity) and the available stripes are 15%, the system is working as expected (Availalble capacity >= Available stripes) → Otherwise, If the system has Available stripes < Availalble capacity, for example, system capacity usage is 85%(15% available capacity) but system only has only 12% of available stripes, probably the defrag process is not keeping up(it could be for different reasons).

High physical space utilization or low available capacity can impact both performance and system write availability.
When the cluster approaches these thresholds, it prioritizes maintaining a production-ready state by reallocating resources to internal processes.
As a result, customer-facing performance may gradually degrade over time till we exit those areas.

Capacity thresholds

ABUNDANT - Available stripes are above 15%
SCARCE - Available stripes are below 15% and above 5.5%
HALT_WRITE - Available stripes are below 5.5%

Monitor

For tracking the available stripes of the cluster, there is a graph in Analytics, called Cluster Defragmentation, which includes the available stripes percentage - in this case, available stripes are ~10.5% (relevant metric for available stripes in the Analytics is Available Stripes Percent):
Available Stripes Percent
For looking into the capacity usage, it is shown in the main dashboard in the VMS GUI - in this case, the used space is 94.6%, which means 5.4% of available capacity(relevant metric for capacity usage in Analytics is Capacity,physical_space_in_use_percent)

Data reduction UI

Notes:
1) In this example, there are 10.5% available stripes and 5.4% available capacity → In this case, the defrag does not have space to reclaim and probably needs to consider deleting unneeded data or snapshots.
2) In case unnecessary data or snapshots are deleted, the data will be deleted as part of the internal delete process to reclaim this space. (It can take time until the system handles the deletion).

Troubleshooting

In case unnecessary data or snapshots are deleted and not reflected in the GUI, it is recommended to contact the Customer Success Team for further investigation and troubleshooting.
In case you cannot remove or delete unnecessary data or snapshots, one of the options is to consider expanding the system with additional storage in the cluster.

Capacity terms can be found at VAST Cluster 5.3 Administrator's Guide (requires a valid user to access the VAST customer portal) under the Capacity section.