Design Considerations
The VAST AI OS memory configuration strives to provide the best performance, scale, and resiliency possible. These are conflicting requirements:
Scale - Provide the largest possible memory for the core OS to support the largest possible number of clients, active protocol requests, and the many configuration objects in the system.
Performance - Be able to serve as many configurations and metadata objects from the node memory to provide the lowest response times. This is, of course, while keeping the CNodes stateless.
Resiliency - The cluster has two services that run on a single CNode:
Cluster Leader - one of the CNodes is automatically elected to run this service.
All CNodes must run with enough free memory to be elected as the leader at any given moment.VMS (VAST Management Server) - one of the CNodes is automatically elected to run the management service (this configuration can be customized with the dedicated VMS configuration).
All CNodes must run with enough free memory to be elected to run VMS at any given moment.
To conclude, the memory configuration must accommodate a state in which a single CNode will run both the core OS, the cluster leader and VMS.
This explanation refers to CNodes but is relevant to EBox nodes as well in the case of an EBox cluster. For an EBox cluster, each CNode is also a DNode. The core OS memory consumption covers both CNode and DNode services.
CNode Memory Configuration
The CNode memory consists of the following:
VAST Core OS | The core OS, including the data ingest (protocols support), erasure coding, cluster services, DB, replication, global namespace, DataEngine, etc. This will consume:
|
Cluster Leader | A single CNode will also run the leader service. The leader consumes ~10-15GB. |
VMS | A single CNode will also run the VMS service, which includes the management service itself, surrounding web services, and the PostgreSQL service. The VMS consumes ~10GB. |
Linux OS | The Linux OS - the kernel and the surrounding Linux services. |
Vendor-specific, other | Some server-specific HW monitoring services. |
Free | This reserved memory acts as a buffer for dynamic runtime allocations, protecting the CNode from out-of-memory (OOM) errors that would trigger a panic and system reboot. |
The following diagram visualizes how a 256GB CNode memory will be allocated:
.png?sv=2026-02-06&spr=https&st=2026-06-26T13%3A49%3A31Z&se=2026-06-26T14%3A02%3A31Z&sr=c&sp=r&sig=f%2FgrTQVhJaEBbpFwnx7%2B9t8mpz6RHZYo9fafFSPmOdk%3D)
CNode memory allocation
A few notes:
VAST Platform and Leader memory are pre-allocated at node startup and take a few minutes. This explains why node memory consumption typically goes up during cluster power-on or after node failovers.
Pre-allocates all the memory it needs at node startup, which takes a few minutes.
VMS memory allocates/deallocates memory as ongoing operations are executed (monitoring tasks, etc.), which consume additional memory by themselves, and also trigger PostgreSQL queries, which by themselves require additional memory.
The design goal is to leave ~10-15% (~25-30GB in 256GB CNodes, 35-55GB in 384GB CNodes) free for the Linux OS and vendor-specific or any other custom services running in the server.
Additional third-party services, if any, will be installed and may consume memory that is outside of the standard calculations, which is why it’s important to coordinate any such requirement with VAST Data.
Troubleshooting Memory Consumption
There can be occasions in which memory consumption becomes higher than expected. In such cases, a VMS alarm may be triggered:
CNode cnode-128-1 (172.16.128.1) [Rack-CB1-U-bottom] memory usage reached to 98.0%
Information to collect and review
When such issues occur, the following information should be captured
Collect two debug bundles within 24 hours
If not possible to collect a full bundle, collect at least the outputs of:
atop- clickm,patop -m 1 1 > atop_memory_report.txtatop -pm 1 1 > atop_grouped_memory.txtSystem-wide memory information:
date > memory_report.txtcat /proc/meminfo >> memory_report.txtvmstat -s >> memory_report.txtsudo slabtop -o | head -n 20 >> memory_report.txt
Process-specific memory information:
Collect the following information for the relevant processes - mostly the top memory-consuming processes (the core VAST OS processes, VMS, etc.)Eyal Traitel
pmap -x [PID] >> memory_report.txtLeader process:
ssh `find-leader`ps -aux | grep aaaa-bbbbccccdddd | awk '{print $2}' | head -1
Docker statistics
docker stats --no-stream >> memory_report.txt
Check for previous process kills
dmesg -T | grep -E -i "killed|oom|out of memory" >> memory_report.txt
Automated analysis
Run
luna analyzeandluna analyze vms_memory