Measure and Monitor CPU Utilization on VAST Data

Prev Next

Customers and CSPs need a clear view of the cluster's actual utilization to determine if the current performance headroom is sufficient, whether more load can be added, and where bottlenecks or imbalances may exist across CNodes.
VAST exposes per-CNode metrics to help customers directly compare node utilization, track trends, and trigger alerts when utilization remains high.

Two key metrics can be used to track CPU utilization per CNode:

  • window_runtime_pct_ingest_only_silos- frontend/protocol ingest work

  • window_runtime_pct_normal_silo - background/BE work

Together, they indicate both user-driven and system activity per CNode. In a healthy, balanced cluster, these values are similar across nodes, typically at up to 50%. Set alerts if either stays above 80% for sustained periods.

Metric Definition and Usage

Each CNode runs multiple silos for execution contexts used by the VAST scheduler to separate different types of tasks. The following metrics can be used to measure and compare per-CNode utilization:

Metric

Description

window_runtime_pct_ingest_only_silos

CPU utilization (%) for ingest-only silos that handle protocol and client I/O requests such as NFS, S3, and SMB.

window_runtime_pct_normal_silos

CPU utilization (%) for normal silos performing internal or background operations such as metadata management, replication, or backend writes.

When both metrics remain roughly balanced across CNodes, the cluster is considered healthy and evenly utilized. It’s also not uncommon to see normal > ingest.

Interpreting CPU Utilization

A balanced, properly utilized cluster typically shows both metrics at up to 50% utilization.
The window_runtime_pct_normal_silos metric is expected to be slightly higher on write-heavy workloads, since writes drive additional background operations such as persistence, replication, and metadata updates.

Short bursts above 80% are normal during workload spikes or maintenance activity.
Sustained utilization above 80% for 30 minutes or longer indicates that the cluster is consistently at high load. The 30-minute guideline provides sufficient sampling time to differentiate between short-term bursts and actual capacity saturation, enabling meaningful alerting without false positives.

Imbalances between CNodes (for example, one node consistently higher than others) may indicate skewed distribution, network bottlenecks, or workload imbalance. For more details on how VAST automatically distributes protocol sessions and rebalances load across CNodes, see the following document about understanding CNode Load Balancing: Best Practice on Load Balancing CNodes

Observation

Possible Cause

Recommended Action

All CNodes high

Cluster is fully utilized or undersized

Consider scaling cluster CNodes/DNodes/EBoxes.

One CNode is consistently higher

Workload imbalance or affinity issue

Check protocol client routing, VIP assignment.

High ingest silo only

Heavy front-end traffic

Validate protocol throughput and network utilization.

High normal silo only

Background processes (replication, rebuild)

Monitor backend load; may normalize automatically — once background jobs complete or self-throttle, utilization returns to baseline without intervention.

ℹ️ Info

Note: VMS web UI reports are available on versions 5.3.3+ and 5.0.0-sp73. The above metrics may be available on other versions, but not in the VMS Web UI report.

Viewing CPU Utilization in the Web UI

  1. Go to Analytics → Analytics from the left navigation menu.

  2. In the right pane, select Predefined Analytics.

  3. Search for CPU Window Utilization.

  4. Use Aggregation Type = Avg and Timeframe = Last 2 hours (or adjust as needed).

The image displays CPU window utilization analytics for two CNodes, cnode-128-5 and cnode126-6, over a 2-hour period, with data aggregated as an average. The graphs show real-time percentage utilizations and latency measurements for both the general silo window and ingest silo windows.

ℹ️ Info

Note: If your version includes the predefined CPU Window Utilization view that shows each CNode in a separate window, use Customized Analytics to compare multiple CNodes on a single chart. Create a custom report with either SchedulerMetrics,window_runtime_pct_ingest_only_silos__avg or SchedulerMetrics,window_runtime_pct_normal_silos__avg:

  1. Analytics → Analytics → Customized Analytics → (+)

  2. Name: e.g., CPU utilization Ingest • Object Type: CNode

  3. In All Properties, search window_runtime_pct_*_silos__avg → add the one you want

The image displays a screen capture from a dashboard interface where an analyst is creating custom analytics named "CPU utilization Ingest." The user has selected properties related to CPU scheduler metrics, specifically focusing on ingest window runtime percentages within silos, with an average value chosen and set for a 10-minute timeframe at a resolution of seconds.

Create analytics

Accessing Metrics via API or Prometheus

REST API Example

curl -sku admin:******* \
"https://<vms_address>/api/latest/monitors/ad_hoc_query?object_type=cnode&time_frame=2h\
&granularity=minutes\
&prop_list=SchedulerMetrics,window_runtime_pct_normal_silos__avg\
&prop_list=SchedulerMetrics,window_runtime_pct_ingest_only_silos__avg"

Example JSON Output (sample non-zero values):

{
  "object_ids": [3, 1, 2],
  "prop_list": [
    "timestamp",
    "object_id",
    "SchedulerMetrics,window_runtime_pct_normal_silos__avg",
    "SchedulerMetrics,window_runtime_pct_ingest_only_silos__avg"
  ],
  "data": [
    ["2025-10-12T15:40:00Z", 3, 57.8, 22.4],
    ["2025-10-12T15:40:00Z", 1, 54.1, 25.0],
    ["2025-10-12T15:40:00Z", 2, 52.6, 23.9],
    ["2025-10-12T15:35:00Z", 3, 61.2, 28.3],
    ["2025-10-12T15:35:00Z", 1, 58.7, 26.1],
    ["2025-10-12T15:35:00Z", 2, 59.5, 27.4],
    ["2025-10-12T15:30:00Z", 3, 49.9, 18.2],
    ["2025-10-12T15:30:00Z", 1, 51.3, 19.7],
    ["2025-10-12T15:30:00Z", 2, 50.6, 18.9],
    ["2025-10-12T15:25:00Z", 3, 64.0, 31.6],
    ["2025-10-12T15:25:00Z", 1, 62.8, 29.8],
    ["2025-10-12T15:25:00Z", 2, 63.5, 30.7]
  ],
  "granularity": "minutes"
}

Prometheus Metrics (names)

Available via /api/prometheusmetrics/all:

vast_cnode_metrics_SchedulerMetrics_window_runtime_pct_ingest_only_silos_avg
vast_cnode_metrics_SchedulerMetrics_window_runtime_pct_normal_silos_avg
vast_cnode_metrics_SchedulerMetrics_window_runtime_ns_ingest_only_silos_{sum,count}
vast_cnode_metrics_SchedulerMetrics_window_runtime_ns_normal_silos_{sum,count}
vast_cnode_metrics_SchedulerMetrics_window_runtime_{sum,count}