How To Safely Power Down a VAST Cluster

Prev Next

Summary

Thanks to its non-volatile memory, the VAST Cluster will handle a sudden, unexpected power outage, but users sometimes ask how to power down the VAST Cluster during a planned outage.

Prerequisites

  • Get a list of all existing outstanding alarms.

Power Down

The steps are relatively straightforward. They are:

  1. Note the CNode on which VMS is currently running. This is an easy way to determine. Note the IP address of the CNode running VMS.

    clush -g cnodes docker ps | grep vms

  2. Deactivate the cluster. This can be done by clicking on the cluster in the Infrastructure -> clusters page and clicking pause/deactivate, or using VCLI (cluster deactivate) to do the same. 

     

    The image shows an interface for managing clusters within a KFS (K

    Deactivate the VAST cluster

  3. Wait until the entire cluster deactivates cleanly on its own. You can verify that the activity state is completed.

    The screenshot displays an activities log for the cluster "vast209-kfs," showing various operations such as enabling/disabling, creating support bundles, and registering cloud clusters with states indicating completion or failure. The timeline also includes entries detailing changes in the cluster state from offline to online via deactivation processes.

This activity log provides a comprehensive overview of recent actions taken on the cluster, their success status, and relevant timestamps, facilitating easy tracking and monitoring of operational activities within the VAST environment.

    Verify cluster deactivation

  4. Setting the cluster control into maintenance mode to avoid attempts by the leader to access the DNodes.

    Steps: (For CNode OR EBox)

    Set disable leader election on all CNodes:

    clush -g cnodes 'touch /vast/data/DONT_RUN_LEADER'

    Set disable leader election on all EBoxes:

    clush -g cnodes 'touch /vast/data/C-4200/DONT_RUN_LEADER'

    Verify the file is set on all CNodes:

    clush -Bg cnodes 'ls /vast/data/DONT_RUN_LEADER'

    Verify the file is set on all EBoxes:

    clush -Bg cnodes 'ls /vast/data/C-4200/DONT_RUN_LEADER'

    Kill the leader:

    vtool suicide

    Please wait 1 minute
    After making sure that there is no new leader running:

    find-leader

  5. Logon to any CNode and execute a clush command to shutdown at the Linux level of the system. You want to stop the CNodes before the DNodes. You also need to avoid an immediate shutdown, as that will cause the clush command to fail because the system starts the shutdown process immediately. An ideal approach is to start shutting down the C's and D's a few minutes in advance, ensuring the C's stop first. If you have some other preferred method for powering down systems, that's fine. Here is an example of stopping the C-nodes in 2 minutes and the D-nodes in 5:

    Verify clush is healthy

    clush -L -a hostname

    Shutdown the cluster CNode

    clush -g cnodes 'sudo /usr/sbin/shutdown -h +2'; clush -g dnodes 'sudo /usr/sbin/shutdown -h +5'

    For EBox, please run the following command

    clush -g cnodes 'sudo /usr/sbin/shutdown -h +2'

ℹ️ Info

Caution: make sure your terminal is healthy (that first clush -a hostname is confirming that). In rare cases, we see that the terminal won't properly execute commands in clush. If that happens, the shutdown won't work. If this happens to you, simply log out and back in again. This matters because you don't want the clush command to hang.

  1. Confirm everything is down. For example, you could ping all nodes on the management network to confirm they are offline.

  2. Actually, physically unplug the power now if that's appropriate. This largely depends on what you plan to do next as part of the power down.
    Note that D-Boxes continue to consume power even after the operating system is off (Need to check on Ceres!)

Power Up

The power-up is similar; the steps are mostly reversed. For example:

  1. Ensure the cluster switches are fully powered up, and all ports (MLAG and ISL) are online. Log in to your switch via ssh and run:

    master> en
    master# conf t
    master(conf) # show interface status
    ......
    master(conf) # show mlag

    Check this on the standby switch as well.

  2. Plug in the C-nodes and wait for them to power up. 

    If, for some reason, a C-node fails to power up, run sudo ipmitool -H <IPMI> -U admin -P <admin passowrd> -I lanplus power on

    1. Wait 5 minutes to ensure the C-nodes start cleanly.

    2. Example of 4 Cnodes:

      date;for i in 10.27.117.10{2,4,6,8}; do echo Server:$i; sudo ipmitool -H $i -U admin -P <admin passowrd>  -I lanplus power on;done

ℹ️ Info

Section 3 is not relevant for EBox clusters

  1. Plug in the D-nodes and wait for them to power up.
    If, for some reason, a D-node fails to power up, run sudo ipmitool -H <IPMI> -U admin -P admin -I lanplus power on

    1. In case the DBox is B2B-Ipmi, we will need to power it up manually.

    2. Wait 5 minutes to ensure the D-nodes start cleanly.

  2. Enable the leader on the cluster - at this point, the leader should start bringing up the VMS automatically.
    Do not Activate the Cluster yet from CLI/GUI

    Steps: (For CNode OR EBox)

    Remove disable leader election file from all CNodes:

    clush -g cnodes 'rm -f /vast/data/DONT_RUN_LEADER'

    Remove disable leader election file from all EBoxes:

    clush -g cnodes 'rm -f /vast/data/C-4200/DONT_RUN_LEADER'

    Verify the file is removed from all CNodes:

    clush -Bg cnodes 'ls /vast/data/DONT_RUN_LEADER'

    Verify the file is removed from all EBoxes:

    clush -Bg cnodes 'ls /vast/data/C-4200/DONT_RUN_LEADER'

    Make sure that the leader is running:

    find-leader

Wait for 2 minutes for the leader to fully recover.

  1. VMS should start automatically. If not, please contact Vast customer support.

    1. Check with the find-vms command and try to reach the GUI browser.

  2. Go into the VMS web UI or use VCLI ( cluster activate) to activate the cluster. 

     

    The screenshot displays an interface showing the "vast209-kfs" cluster in Offline status with the option to perform actions such as Upgrade, Activate, Deactivate, and Rename from its dropdown menu. The RAID State is marked Healthy, and the Stripes Health also indicates a healthy state.

    Activate the VAST cluster

  3. Wait until all nodes are activated. This will take some time.

    1. Once all nodes are activated and the cluster is available. Check with a client to ensure that data services are as expected.

Limitations

  • VAST will only support powered-down systems to retain data for up to 90 days.

    • Why: The flash memory in SSDs holds data using a stored charge. The electrons that make up that charge are very small and will eventually leak from the flash cells, causing data degradation. As a result, the JEDEC (Joint Electron Device Engineering Council) standard for enterprise SSDs requires SSD vendors warranty that SSDs will reliably hold data for 90 days when powered off.

    • See Risk of Data Retention on Solid State Devices Stored Without Power