5. Observability and Metrics

The VAST Data platform offers a comprehensive set of performance and telemetry metrics that provide deep visibility into system behavior, workload performance, and Quality of Service (QoS) enforcement. These metrics are essential for monitoring infrastructure health, troubleshooting performance anomalies, validating SLAs, and enabling observability in multi-tenant environments. For example, they help detect bandwidth bottlenecks, track latency spikes, and identify “noisy neighbor” workloads. VAST metrics also support usage-based analytics and capacity forecasting, which are critical for optimizing resource allocation.

This enables Cloud Service Providers (CSPs) to use VAST metrics as the foundation for delivering transparent, metered services to tenants. CSPs can expose selected metrics—such as per-tenant IOPS, bandwidth, or latency—via customer-facing dashboards or API integrations. This enables tenant-level performance reporting and SLA validation while maintaining strict isolation and control. Metrics are collected across all system layers (CNodes, DNodes, switches) and are available in real time or over historical ranges via:

Prometheus Exporter: Exposes metrics in Prometheus/OpenMetrics format
REST API: Full access to raw and derived metrics

The diagram illustrates a system architecture where VMS (Value Management System) aggregates metrics from various Cnodes and Dnodes, storing them in Prom. Exp.'s Metrics DB while also providing all collected metrics via REST API. Additionally, it shows Open Metrics being processed by a Switch before aggregation, with binary format metrics exchanged between components nodes. — API to VMS diagram

Note: Metrics are stored in the VAST Management System (VMS) database and aggregated at 5-minute intervals (or at 1-minute granularity in VAST 5.3+), supporting both internal monitoring and external service reporting.

Metrics Visualization

VAST provides two main tools for visualizing system metrics: the Web UI Dashboard and Grafana Dashboards. These tools enable administrators and cloud providers to monitor performance, detect anomalies, and manage tenant-level observability.

Web UI Dashboard

The VAST Web UI provides a real-time dashboard that displays key cluster metrics, including capacity usage, IOPS, bandwidth, and top-consuming users and views. It provides a high-level overview and enables dynamic sorting to quickly identify performance hotspots or imbalances.

Tenant Managers also have access to this dashboard, but visibility is limited to their own data. It shows per-tenant capacity, IOPS, bandwidth, and usage trends, supporting self-service monitoring in multi-tenant environments.

The screenshot displays an overview dashboard from VAST, showing real-time capacity metrics like data reduction (99.8%), performance indicators such as bandwidth at 7,996 MB/sec and IOPS at 40,535, along with storage inventory details including active CNodes, DNodes, NVRAMs, and SSDs. Graphical representations indicate fluctuations in bandwidth, IO operations per second (IOPS), read/write latencies over time one hour intervals. — VMS Dashboard

Grafana Dashboards

VAST provides a comprehensive suite of pre-built Grafana dashboards designed for deep observability and performance analysis. Key highlights include:

Version Compatibility: Works with VAST versions 5.1-sp40 and later using the built-in Prometheus exporter.
Easy Import: Dashboards are provided as .json files that can be directly imported into your Grafana instance.
Organized Views: Dashboards are organized by tenant, view, and node for targeted troubleshooting.
Use Cases: Ideal for real-time monitoring, historical analysis, QoS enforcement validation, and capacity planning.

These dashboards are production-ready and recommended as-is or as a reference for building custom visualizations. They help ensure consistent metric usage across VAST versions, reduce the chance of misinterpreting metric semantics, and simplify integration with external systems.

To use them, import the .json file, configure your Prometheus data source, and start visualizing metrics. Customized dashboards tailored to specific CSP use cases are also available upon request.

For more details, visit the VAST Grafana Dashboards repository.

The dashboard provides an overview of the cluster's health and performance metrics, including active connections, capacity used, and detailed I/O operations per second (IOPS) across different protocols such as NFSv3, S3, SMB, and NFSv4.

Key performance indicators include metadata IOPS, average bandwidth latency, read/write bandwidth, and read/write latencies, offering insights into system efficiency and resource utilization over time. — VAST Grafana Dashboard

Recommended Expressions on VAST

Purpose	PromQL Expression
Read IOPS	rate(vast_view_metrics_ViewMetrics_read_iops_count[5m])
Read Bandwidth	rate(vast_view_metrics_ViewMetrics_read_bw_sum[5m])
Read Latency	rate(vast_view_metrics_ViewMetrics_read_latency_sum[5m]) / rate(vast_view_metrics_ViewMetrics_read_latency_count[5m])
Write IOPS	rate(vast_view_metrics_ViewMetrics_write_iops_count[5m])
Write Bandwidth	rate(vast_view_metrics_ViewMetrics_write_bw_sum[5m])
Write Latency	rate(vast_view_metrics_ViewMetrics_write_latency_sum[5m]) / rate(vast_view_metrics_ViewMetrics_write_latency_count[5m])
QoS Throttling	rate(vast_view_metrics_ViewMetrics_qos_wait_for_budget_time_sum[5m]) / rate(vast_view_metrics_ViewMetrics_qos_wait_for_budget_time_count[5m])

Derived Metrics (from Version 5.3 and higher)

If PromQL is too complex or unsupported, VAST offers derived metrics. These metrics are based on periodic averages and are less accurate over longer time windows due to the averaging characteristics:

Purpose	Metric Name
Read IOPS	vast_view_metrics_ViewMetrics_read_iops_time_avg
Read Bandwidth	vast_view_metrics_ViewMetrics_read_bw_sum_time_avg
Read Latency	vast_view_metrics_ViewMetrics_read_latency_avg
Write IOPS	vast_view_metrics_ViewMetrics_write_iops_time_avg
Write Bandwidth	vast_view_metrics_ViewMetrics_write_bw_time_avg
Write Latency	vast_view_metrics_ViewMetrics_write_latency_avg
QoS Throttling	vast_view_metrics_ViewMetrics_qos_wait_for_budget_time_avg

Command line:

vastpy-cli --json get monitors/ad_hoc_query object_type=view time_frame=5m object_ids=3 prop_list=ViewMetrics,read_bw__time_avg prop_list=ViewMetrics,read_iops__time_avg prop_list=ViewMetrics,read_latency__avg

Output format:

"prop_list": [

    "timestamp",

    "object_id",

    "ViewMetrics,read_bw__time_avg",

    "ViewMetrics,read_iops__time_avg",

    "ViewMetrics,read_latency__avg"

  ],

QoS Metrics Overview

Metrics / Concept	Description
vast_view_metrics_ViewMetrics_qos_wait_for_budget_time_avg	Windowed mean time requests in this view spent waiting on QoS budget during the scrape window. Indicates presence/degree of QoS gating. Mostly >0 since it measures the time a code section takes, which is part of IO processing.
vast_view_metrics_ViewMetrics_qos_wait_for_budget_time_sum	Cumulative seconds of QoS wait accrued by the view (monotonic; use `rate()` for per-second).
vast_view_metrics_ViewMetrics_qos_wait_for_budget_time_count	Cumulative count of affected events included in the `*_sum`. Pair with `sum` to derive mean added latency per affected operation: `rate(sum)/rate(count)`.
vast_view_metrics_ViewMetrics_read_bw_avg vast_view_metrics_ViewMetrics_write_bw_avg	Windowed average delivered bandwidth for the view (bytes/s). It is useful to see if throughput is at or near the configured QoS cap. *
vast_view_metrics_ViewMetrics_read_iops_time_avg vast_view_metrics_ViewMetrics_write_iops_time_avg	Windowed average IOPS for the view (ops/s). Helps separate small-IO vs. streaming patterns. *
vast_user_read_bw vast_user_write_bw	Per-user windowed average bandwidth (bytes/s). Complements view-level utilization.
vast_user_read_iops vast_user_write_iops	Per-user windowed average IOPS (ops/s).

Notes:

Window length for “*_avg” metrics in the averaging window equals your Prometheus scrape_interval (e.g., 15s), unless configured differently (prometheus.yml)
Scopes & Endpoints:
- /api/prometheusmetrics/views → per-view (QoS, performance, etc.)
- /api/prometheusmetrics/users → per-user (bandwidth, IOPS, etc.)
HELP/TYPE lines: Each series carries # HELP <metric> <description> and # TYPE <metric> <type>. Treat these as the authoritative contract for your cluster/build.

Tracking Capacity Usage per User (UID)

VAST supports tracking storage usage by individual users (UIDs) through user-aware quotas. This eliminates the need for customers to walk through the entire view structure to manually calculate per-user usage.

You can enable user capacity tracking on any directory-level quota — such as those automatically created by the CSI driver — without needing per-user definitions or hard limits. Once enabled, VAST exports per-UID usage metrics via Prometheus, which can be visualized in Grafana.

Quick Setup via Web UI

Step 1 – Create or Edit a Quota (No Limits Required):

Navigate to Settings → Element Store → Quotas.
Create or edit a directory-level quota.
(Optional) Leave soft/hard limits blank for tracking-only quotas.
Enable the toggle: “User/Group Quotas”.
Under the Default User Rule, set limits to 0 to avoid enforcement.
Click Update

Step 2 – Monitor Prometheus Capacity Metrics Per-UID

Once user tracking is enabled, the following metrics are exported via Prometheus:

promql

vast_user_quota_used_capacity{cluster="<cluster>", identifier="<uid>", path="<view>"}

Reports the logical space used (in bytes) per user and per directory
Can be grouped by identifier (UID) or path

Grafana Dashboard Reference

Use or customize VAST’s official Grafana dashboards to visualize UID usage:

Repository: vast-data/vast-grafana-dashboards
Recommended Dashboard: Top Actors – Users

Client-Side Observability (NFS only)

VAST's vNFS Collector is an open-source tool that provides deep visibility into NFS workloads by capturing detailed I/O metrics for every NFS mount. It tracks per-operation counters for all key NFSv3 and NFSv4 commands, including READ, WRITE, LOOKUP, and DELETE, along with contextual metadata such as mount points, process names, user IDs, and environment variables like SLURM JOB ID. This rich dataset enables accurate workload profiling and performance tuning.

The collector supports flexible data forwarding, with local JSON logging and seamless integration into Prometheus (for Grafana dashboards), Kafka (for event-driven pipelines), and the VAST DataBase (for historical analytics via Trino, Spark, and Grafana).

For more information, visit: