Low Space Alerts on /vast, /vast/data/, or /userdata

Prev Next

Problem:

An alert indicates that one or more nodes do not have sufficient space remaining on one of the local file systems.

The local filesystems that are commonly referenced are:

  • /vast

  • /vast/data

  • /userdata

An example of the alert typically received:

[2021-05-27 23:59:36,784: WARNING/ForkPoolWorker-106684/11277] Alarm (CRITICAL): LOGDOCKER - dnode-101 (10.100.100.100) [dnode001] - 2021-05-27T23:59:12.125913+00:00 ALERT[P9:E69:S25:F15 time="2021-05-27 23:59:12.125372629"]: trace dumper env_id=74: not enough space left on device /vast/data/traces/env (available 4280463360)

  • Other common error messages may include:

    • Not enough space left on device /vast/data/traces/env
    • /userdata partition available space dropped below 15G

We recommend running the workaround procedures in the solution section below to free up the space.

First Step:

Check whether any old Support Bundles are occupying space.

In UI, go to Support → Support Bundle and remove bundles that are there, assuming they aren’t from that day.

Animated GIF showing the removal of the old support bundle.

Animated GIF showing the removal of the old support bundle.

Summary:

Check to see if the space usage matches what is expected:

#check "/" partition, validate we have at least 10% free:
clush -a 'df -h / |grep -v Avail' |sort -h -k 5

# check "/vast" partition, validate we have are least 10% free:
clush -a 'df -h /vast |grep -v Avail' |sort -h -k 5

#check "/userdata" partition, validate we have at least 15 GB available
clush -a 'df -h /userdata |grep -v Avail' |sort -h -k 5

Solution:

To address the alarm, we suggest cleaning up old trace data that may no longer be required.

Using the clush command for all nodes / Only Cnodes / Only Dnodes / Specified Node:

  • -a #all

  • -g #group (cnodes / dnodes)

  • -w #nodes (IP - xx.xx.xx.xx)

To remove by a number of days (30 days) - all nodes:

clush -a 'sudo find /vast/data/metrics/ -type f -ctime +30 -delete'
clush -a 'sudo find /vast/data/traces/env/ -type f -ctime +30 -delete'

ℹ️ Info

In the Scale system, when working with a large number of CNodes or DNodes, we recommend running delete commands in batches for optimal performance and efficiency

To remove by a number of days (30 days) - a group of cnodes or dnodes:

clush -g cnodes 'sudo find /vast/data/metrics/ -type f -ctime +30 -delete'
clush -g cnodes 'sudo find /vast/data/traces/env/ -type f -ctime +30 -delete'

To remove by a number of days (30 days) - specified node/s:

clush -w 172.16.128.31 'sudo find /vast/data/metrics/ -type f -ctime +30 -delete'
clush -w 172.16.128.31 'sudo find /vast/data/traces/env/ -type f -ctime +30 -delete'

To specify a specific date and start time (e.g. - 2022-11-22 at 11:00:00):

clush -a 'sudo find /vast/data/metrics/ -type f -not -newermt "2022-11-22 11:00:00" -delete'
clush -a 'sudo find /vast/data/traces/env/ -type f -not -newermt "2022-11-22 11:00:00" -delete'

An example workflow of clearing out space on the DNode and addressing these low-space alerts would look like this:

  • Start by checking the following to assess the amount of space:

    clush -a "df -h /vast | grep -v Mounted"
    clush -a "df -h /userdata | grep -v Mounted"
  • If the problem is in /userdata, it will generally be old bundles or install files. An example of this would look like this:

    /userdata/bundles/bundle-xxxxxx
    /userdata/release-*
    /userdata/bundles/upgrades/*
  • Most often, it's the /vast partition. If that's the case, you can start with the following:

    find /vast/data/metrics/ -type f -not -newermt '2022-10-01 00:00:00' -delete

    (Note: You can adjust the date, but DNodes generally do not need metrics for further back than a week or two.)

  • Next, you can run the following commands:

    clush -a "rm /vast/data/traces/env/2020*"
    clush -a "rm /vast/data/traces/env/2021*"
    clush -a "rm /vast/data/traces/env/20220*"
    clush -a "rm /vast/data/traces/env/202210*"
    clush -a "rm /vast/data/traces/env/202211*"
    clush -a "rm /vast/data/traces/env/202212*"
    clush -a "rm /vast/data/traces/env/202301*"
  • Check the below command in between each of the above lines:

    clush -a "df -h /vast | grep -v Mounted"
  • When you have enough space, you can stop the process.

EBox:

EBox is a physical enclosure that holds both compute and SSD slots, so the metrics and traces are located in different places.

To remove by a number of days (30 days) - all nodes:

clush -a 'sudo find /vast/data/{C-4200,D-4000,D-4100}/metrics/ -type f -ctime +30 -delete'
clush -a 'sudo find /vast/data/{C-4200,D-4000,D-4100}/traces/env/ -type f -ctime +30 -delete'

To remove by a number of days (30 days) - a group of cnodes or dnodes:

# Only Cnodes
clush -a 'sudo find /vast/data/C-4200/metrics/ -type f -ctime +30 -delete'
clush -a 'sudo find /vast/data/C-4200/traces/env/ -type f -ctime +30 -delete'

# Only Dnodes
clush -a 'sudo find /vast/data/{D-4000,D-4100}/metrics/ -type f -ctime +30 -delete'
clush -a 'sudo find /vast/data/{D-4000,D-4100}/traces/env/ -type f -ctime +30 -delete'

To remove by a number of days (30 days) - specified node/s:

clush -w 172.16.128.31 'sudo find /vast/data/C-4200/metrics/ -type f -ctime +30 -delete'
clush -w 172.16.128.31 'sudo find /vast/data/C-4200/traces/env/ -type f -ctime +30 -delete'

Support:

In case the above commands didn’t resolve the issue, and you still notice that you do not have enough space left on one of the local filesystems,

  • Please engage VAST Customer Support if you need to delete any trace or metric data that is less than 30 days old.

  • Please also check with support prior to deleting any files or folders in the vast or userdata directory.