How To Safely Power Down a VAST Cluster

Summary

Thanks to its non-volatile memory, the VAST Cluster will handle a sudden, unexpected power outage, but users sometimes ask how to power down the VAST Cluster during a planned outage.

Prerequisites

Get a list of all existing outstanding alarms.

Power Down

The steps are relatively straightforward. They are:

Note the CNode on which VMS is currently running. This is an easy way to determine. Note the IP address of the CNode running VMS.
```
clush -g cnodes docker ps | grep vms
```
Deactivate the cluster. This can be done by clicking on the cluster in the Infrastructure -> clusters page and clicking pause/deactivate, or using VCLI (cluster deactivate) to do the same.

Deactivate the VAST cluster
Wait until the entire cluster deactivates cleanly on its own. You can verify that the activity state is completed.
Verify cluster deactivation
Setting the cluster control into maintenance mode to avoid attempts by the leader to access the DNodes.
Steps: (For CNode OR EBox)
Set disable leader election on all CNodes:
```
clush -g cnodes 'touch /vast/data/DONT_RUN_LEADER'
```
Set disable leader election on all EBoxes:
```
clush -g cnodes 'touch /vast/data/C-4200/DONT_RUN_LEADER'
```
Verify the file is set on all CNodes:
```
clush -Bg cnodes 'ls /vast/data/DONT_RUN_LEADER'
```
Verify the file is set on all EBoxes:
```
clush -Bg cnodes 'ls /vast/data/C-4200/DONT_RUN_LEADER'
```
Kill the leader:
```
vtool suicide
```
Please wait 1 minute
After making sure that there is no new leader running:
```
find-leader
```
Logon to any CNode and execute a clush command to shutdown at the Linux level of the system. You want to stop the CNodes before the DNodes. You also need to avoid an immediate shutdown, as that will cause the clush command to fail because the system starts the shutdown process immediately. An ideal approach is to start shutting down the C's and D's a few minutes in advance, ensuring the C's stop first. If you have some other preferred method for powering down systems, that's fine. Here is an example of stopping the C-nodes in 2 minutes and the D-nodes in 5:
Verify clush is healthy
```
clush -L -a hostname
```
Shutdown the cluster CNode
```
clush -g cnodes 'sudo /usr/sbin/shutdown -h +2'; clush -g dnodes 'sudo /usr/sbin/shutdown -h +5'
```
For EBox, please run the following command
```
clush -g cnodes 'sudo /usr/sbin/shutdown -h +2'
```

ℹ️ Info
Caution: make sure your terminal is healthy (that first clush -a hostname is confirming that). In rare cases, we see that the terminal won't properly execute commands in clush. If that happens, the shutdown won't work. If this happens to you, simply log out and back in again. This matters because you don't want the clush command to hang.

Confirm everything is down. For example, you could ping all nodes on the management network to confirm they are offline.
Actually, physically unplug the power now if that's appropriate. This largely depends on what you plan to do next as part of the power down.
Note that D-Boxes continue to consume power even after the operating system is off (Need to check on Ceres!)

Power Up

The power-up is similar; the steps are mostly reversed. For example:

Ensure the cluster switches are fully powered up, and all ports (MLAG and ISL) are online. Log in to your switch via ssh and run:
```
master> en
master# conf t
master(conf) # show interface status
......
master(conf) # show mlag
```
Check this on the standby switch as well.
Plug in the C-nodes and wait for them to power up.
If, for some reason, a C-node fails to power up, run sudo ipmitool -H <IPMI> -U admin -P <admin passowrd> -I lanplus power on
1. Wait 5 minutes to ensure the C-nodes start cleanly.
2. Example of 4 Cnodes:
```
date;for i in 10.27.117.10{2,4,6,8}; do echo Server:$i; sudo ipmitool -H $i -U admin -P <admin passowrd>  -I lanplus power on;done
```

ℹ️ Info
Section 3 is not relevant for EBox clusters

Plug in the D-nodes and wait for them to power up.
If, for some reason, a D-node fails to power up, run sudo ipmitool -H <IPMI> -U admin -P admin -I lanplus power on
1. In case the DBox is B2B-Ipmi, we will need to power it up manually.
2. Wait 5 minutes to ensure the D-nodes start cleanly.
Enable the leader on the cluster - at this point, the leader should start bringing up the VMS automatically.
Do not Activate the Cluster yet from CLI/GUI
Steps: (For CNode OR EBox)
Remove disable leader election file from all CNodes:
```
clush -g cnodes 'rm -f /vast/data/DONT_RUN_LEADER'
```
Remove disable leader election file from all EBoxes:
```
clush -g cnodes 'rm -f /vast/data/C-4200/DONT_RUN_LEADER'
```
Verify the file is removed from all CNodes:
```
clush -Bg cnodes 'ls /vast/data/DONT_RUN_LEADER'
```
Verify the file is removed from all EBoxes:
```
clush -Bg cnodes 'ls /vast/data/C-4200/DONT_RUN_LEADER'
```
Make sure that the leader is running:
```
find-leader
```

Wait for 2 minutes for the leader to fully recover.

VMS should start automatically. If not, please contact Vast customer support.
1. Check with the find-vms command and try to reach the GUI browser.
Go into the VMS web UI or use VCLI ( cluster activate) to activate the cluster.

Activate the VAST cluster
Wait until all nodes are activated. This will take some time.
1. Once all nodes are activated and the cluster is available. Check with a client to ensure that data services are as expected.

Limitations

VAST will only support powered-down systems to retain data for up to 90 days.
- Why: The flash memory in SSDs holds data using a stored charge. The electrons that make up that charge are very small and will eventually leak from the flash cells, causing data degradation. As a result, the JEDEC (Joint Electron Device Engineering Council) standard for enterprise SSDs requires SSD vendors warranty that SSDs will reliably hold data for 90 days when powered off.
- See Risk of Data Retention on Solid State Devices Stored Without Power