5. Observability and Metrics

Prev Next

The VAST Data platform offers a comprehensive set of performance and telemetry metrics that provide deep visibility into system behavior, workload performance, and Quality of Service (QoS) enforcement. These metrics are essential for monitoring infrastructure health, troubleshooting performance anomalies, validating SLAs, and enabling observability in multi-tenant environments. For example, they help detect bandwidth bottlenecks, track latency spikes, and identify “noisy neighbor” workloads. VAST metrics also support usage-based analytics and capacity forecasting, which are critical for optimizing resource allocation.

This enables Cloud Service Providers (CSPs) to use VAST metrics as the foundation for delivering transparent, metered services to tenants. CSPs can expose selected metrics—such as per-tenant IOPS, bandwidth, or latency—via customer-facing dashboards or API integrations. This enables tenant-level performance reporting and SLA validation while maintaining strict isolation and control. Metrics are collected across all system layers (CNodes, DNodes, switches) and are available in real time or over historical ranges via:

  • Prometheus Exporter: Exposes metrics in Prometheus/OpenMetrics format

  • REST API: Full access to raw and derived metrics

The diagram illustrates a system architecture where VMS (Value Management System) aggregates metrics from various Cnodes and Dnodes, storing them in Prom. Exp.'s Metrics DB while also providing all collected metrics via REST API. Additionally, it shows Open Metrics being processed by a Switch before aggregation, with binary format metrics exchanged between components nodes.

API to VMS diagram

Note: Metrics are stored in the VAST Management System (VMS) database and aggregated at 5-minute intervals (or at 1-minute granularity in VAST 5.3+), supporting both internal monitoring and external service reporting.

Metrics Visualization

VAST provides two main tools for visualizing system metrics: the Web UI Dashboard and Grafana Dashboards. These tools enable administrators and cloud providers to monitor performance, detect anomalies, and manage tenant-level observability.

Web UI Dashboard

The VAST Web UI provides a real-time dashboard that displays key cluster metrics, including capacity usage, IOPS, bandwidth, and top-consuming users and views. It provides a high-level overview and enables dynamic sorting to quickly identify performance hotspots or imbalances.

Tenant Managers also have access to this dashboard, but visibility is limited to their own data. It shows per-tenant capacity, IOPS, bandwidth, and usage trends, supporting self-service monitoring in multi-tenant environments.

The screenshot displays an overview dashboard from VAST, showing real-time capacity metrics like data reduction (99.8%), performance indicators such as bandwidth at 7,996 MB/sec and IOPS at 40,535, along with storage inventory details including active CNodes, DNodes, NVRAMs, and SSDs. Graphical representations indicate fluctuations in bandwidth, IO operations per second (IOPS), read/write latencies over time one hour intervals.

VMS Dashboard

Grafana Dashboards

VAST provides a comprehensive suite of pre-built Grafana dashboards designed for deep observability and performance analysis. Key highlights include:

  • Version Compatibility: Works with VAST versions 5.1-sp40 and later using the built-in Prometheus exporter.

  • Easy Import: Dashboards are provided as .json files that can be directly imported into your Grafana instance.

  • Organized Views: Dashboards are organized by tenant, view, and node for targeted troubleshooting.

  • Use Cases: Ideal for real-time monitoring, historical analysis, QoS enforcement validation, and capacity planning.

These dashboards are production-ready and recommended as-is or as a reference for building custom visualizations. They help ensure consistent metric usage across VAST versions, reduce the chance of misinterpreting metric semantics, and simplify integration with external systems.

To use them, import the .json file, configure your Prometheus data source, and start visualizing metrics. Customized dashboards tailored to specific CSP use cases are also available upon request.

For more details, visit the VAST Grafana Dashboards repository.

The dashboard provides an overview of the cluster's health and performance metrics, including active connections, capacity used, and detailed I/O operations per second (IOPS) across different protocols such as NFSv3, S3, SMB, and NFSv4.

Key performance indicators include metadata IOPS, average bandwidth latency, read/write bandwidth, and read/write latencies, offering insights into system efficiency and resource utilization over time.

VAST Grafana Dashboard

Recommended Expressions on VAST

Purpose

PromQL Expression

Read IOPS

rate(vast_view_metrics_ViewMetrics_read_iops_count[5m])

Read Bandwidth

rate(vast_view_metrics_ViewMetrics_read_bw_sum[5m])

Read Latency

rate(vast_view_metrics_ViewMetrics_read_latency_sum[5m]) / rate(vast_view_metrics_ViewMetrics_read_latency_count[5m])

Write IOPS

rate(vast_view_metrics_ViewMetrics_write_iops_count[5m])

Write Bandwidth

rate(vast_view_metrics_ViewMetrics_write_bw_sum[5m])

Write Latency

rate(vast_view_metrics_ViewMetrics_write_latency_sum[5m]) / rate(vast_view_metrics_ViewMetrics_write_latency_count[5m])

QoS Throttling

rate(vast_view_metrics_ViewMetrics_qos_wait_for_budget_time_sum[5m]) / rate(vast_view_metrics_ViewMetrics_qos_wait_for_budget_time_count[5m])

Derived Metrics (from Version 5.3 and higher)

If PromQL is too complex or unsupported, VAST offers derived metrics. These metrics are based on periodic averages and are less accurate over longer time windows due to the averaging characteristics:

Purpose

Metric Name

Read IOPS

vast_view_metrics_ViewMetrics_read_iops_time_avg

Read Bandwidth

vast_view_metrics_ViewMetrics_read_bw_sum_time_avg

Read Latency

vast_view_metrics_ViewMetrics_read_latency_avg

Write IOPS

vast_view_metrics_ViewMetrics_write_iops_time_avg

Write Bandwidth

vast_view_metrics_ViewMetrics_write_bw_time_avg

Write Latency

vast_view_metrics_ViewMetrics_write_latency_avg

QoS Throttling

vast_view_metrics_ViewMetrics_qos_wait_for_budget_time_avg

Command line:

vastpy-cli --json get monitors/ad_hoc_query object_type=view time_frame=5m object_ids=3 prop_list=ViewMetrics,read_bw__time_avg prop_list=ViewMetrics,read_iops__time_avg prop_list=ViewMetrics,read_latency__avg

Output format:                                                                                                                                         

"prop_list": [

    "timestamp",

    "object_id",

    "ViewMetrics,read_bw__time_avg",

    "ViewMetrics,read_iops__time_avg",

    "ViewMetrics,read_latency__avg"

  ],

QoS Metrics Overview

Metrics / Concept

Description

vast_view_metrics_ViewMetrics_qos_wait_for_budget_time_avg

Windowed mean time requests in this view spent waiting on QoS budget during the scrape window. Indicates presence/degree of QoS gating. Mostly >0 since it measures the time a code section takes, which is part of IO processing.

vast_view_metrics_ViewMetrics_qos_wait_for_budget_time_sum

Cumulative seconds of QoS wait accrued by the view (monotonic; use rate() for per-second).

vast_view_metrics_ViewMetrics_qos_wait_for_budget_time_count

Cumulative count of affected events included in the *_sum. Pair with sum to derive mean added latency per affected operation: rate(sum)/rate(count).

vast_view_metrics_ViewMetrics_read_bw_avg vast_view_metrics_ViewMetrics_write_bw_avg

Windowed average delivered bandwidth for the view (bytes/s). It is useful to see if throughput is at or near the configured QoS cap. *

vast_view_metrics_ViewMetrics_read_iops_time_avg vast_view_metrics_ViewMetrics_write_iops_time_avg

Windowed average IOPS for the view (ops/s). Helps separate small-IO vs. streaming patterns. *

vast_user_read_bw

vast_user_write_bw

Per-user windowed average bandwidth (bytes/s). Complements view-level utilization.

vast_user_read_iops

vast_user_write_iops

Per-user windowed average IOPS (ops/s).

Notes:

  1. Window length for “*_avg” metrics in the averaging window equals your Prometheus scrape_interval (e.g., 15s), unless configured differently (prometheus.yml)

  2. Scopes & Endpoints:

    • /api/prometheusmetrics/views → per-view (QoS, performance, etc.)

    • /api/prometheusmetrics/users → per-user (bandwidth, IOPS, etc.)

  3. HELP/TYPE lines: Each series carries # HELP <metric> <description> and # TYPE <metric> <type>. Treat these as the authoritative contract for your cluster/build.

Tracking Capacity Usage per User (UID)

VAST supports tracking storage usage by individual users (UIDs) through user-aware quotas. This eliminates the need for customers to walk through the entire view structure to manually calculate per-user usage.

You can enable user capacity tracking on any directory-level quota — such as those automatically created by the CSI driver — without needing per-user definitions or hard limits. Once enabled, VAST exports per-UID usage metrics via Prometheus, which can be visualized in Grafana.

Quick Setup via Web UI

Step 1 – Create or Edit a Quota (No Limits Required):

  • Navigate to Settings → Element Store → Quotas.

  • Create or edit a directory-level quota.

  • (Optional) Leave soft/hard limits blank for tracking-only quotas.

  • Enable the toggle: “User/Group Quotas”.

  • Under the Default User Rule, set limits to 0 to avoid enforcement.

  • Click Update

Step 2 – Monitor Prometheus Capacity Metrics Per-UID

Once user tracking is enabled, the following metrics are exported via Prometheus:

promql

vast_user_quota_used_capacity{cluster="<cluster>", identifier="<uid>", path="<view>"} 
  • Reports the logical space used (in bytes) per user and per directory

  • Can be grouped by identifier (UID) or path

Grafana Dashboard Reference

Use or customize VAST’s official Grafana dashboards to visualize UID usage:

Client-Side Observability (NFS only)

VAST's vNFS Collector is an open-source tool that provides deep visibility into NFS workloads by capturing detailed I/O metrics for every NFS mount. It tracks per-operation counters for all key NFSv3 and NFSv4 commands, including READ, WRITE, LOOKUP, and DELETE, along with contextual metadata such as mount points, process names, user IDs, and environment variables like SLURM JOB ID. This rich dataset enables accurate workload profiling and performance tuning.

The collector supports flexible data forwarding, with local JSON logging and seamless integration into Prometheus (for Grafana dashboards), Kafka (for event-driven pipelines), and the VAST DataBase (for historical analytics via Trino, Spark, and Grafana).


For more information, visit: