Documentation Index

Fetch the complete documentation index at: https://kb.vastdata.com/llms.txt

Use this file to discover all available pages before exploring further.

5. Observability and Metrics

Prev Next

The VAST Data platform offers a comprehensive set of performance and telemetry metrics that provide deep visibility into system behavior, workload performance, and Quality of Service (QoS) enforcement. These metrics are essential for monitoring infrastructure health, troubleshooting performance anomalies, validating SLAs, and enabling observability in multi-tenant environments. For example, they help detect bandwidth bottlenecks, track latency spikes, and identify “noisy neighbor” workloads. VAST metrics also support usage-based analytics and capacity forecasting, which are critical for optimizing resource allocation.

This enables Cloud Service Providers (CSPs) to use VAST metrics as the foundation for delivering transparent, metered services to tenants. CSPs can expose selected metrics—such as per-tenant IOPS, bandwidth, or latency—via customer-facing dashboards or API integrations. This enables tenant-level performance reporting and SLA validation while maintaining strict isolation and control. Metrics are collected across all system layers (CNodes, DNodes, switches) and are available in real time or over historical ranges via:

  • Prometheus Exporter: Exposes metrics in Prometheus/OpenMetrics format

  • REST API: Full access to raw and derived metrics

The diagram illustrates a system architecture where VMS (Value Management System) aggregates metrics from various Cnodes and Dnodes, storing them in Prom. Exp.'s Metrics DB while also providing all collected metrics via REST API. Additionally, it shows Open Metrics being processed by a Switch before aggregation, with binary format metrics exchanged between components nodes.

API to VMS diagram

Note: Metrics are stored in the VAST Management System (VMS) database and aggregated at 5-minute intervals (or at 1-minute granularity in VAST 5.3+), supporting both internal monitoring and external service reporting.

Metrics Visualization

VAST provides two main tools for visualizing system metrics: the Web UI Dashboard and Grafana Dashboards. These tools enable administrators and cloud providers to monitor performance, detect anomalies, and manage tenant-level observability.

Web UI Dashboard

The VAST Web UI provides a real-time dashboard that displays key cluster metrics, including capacity usage, IOPS, bandwidth, and top-consuming users and views. It provides a high-level overview and enables dynamic sorting to quickly identify performance hotspots or imbalances.

Tenant Managers also have access to this dashboard, but visibility is limited to their own data. It shows per-tenant capacity, IOPS, bandwidth, and usage trends, supporting self-service monitoring in multi-tenant environments.

The screenshot displays an overview dashboard from VAST, showing real-time capacity metrics like data reduction (99.8%), performance indicators such as bandwidth at 7,996 MB/sec and IOPS at 40,535, along with storage inventory details including active CNodes, DNodes, NVRAMs, and SSDs. Graphical representations indicate fluctuations in bandwidth, IO operations per second (IOPS), read/write latencies over time one hour intervals.

VMS Dashboard

Grafana Dashboards

VAST provides a comprehensive suite of pre-built Grafana dashboards designed for deep observability and performance analysis. Key highlights include:

  • Version Compatibility: Works with VAST versions 5.1-sp40 and later using the built-in Prometheus exporter.

  • Easy Import: Dashboards are provided as .json files that can be directly imported into your Grafana instance.

  • Organized Views: Dashboards are organized by tenant, view, and node for targeted troubleshooting.

  • Use Cases: Ideal for real-time monitoring, historical analysis, QoS enforcement validation, and capacity planning.

These dashboards are production-ready and recommended as-is or as a reference for building custom visualizations. They help ensure consistent metric usage across VAST versions, reduce the chance of misinterpreting metric semantics, and simplify integration with external systems.

To use them, import the .json file, configure your Prometheus data source, and start visualizing metrics. Customized dashboards tailored to specific CSP use cases are also available upon request.

For more details, visit the VAST Grafana Dashboards repository.

The dashboard provides an overview of the cluster's health and performance metrics, including active connections, capacity used, and detailed I/O operations per second (IOPS) across different protocols such as NFSv3, S3, SMB, and NFSv4.

Key performance indicators include metadata IOPS, average bandwidth latency, read/write bandwidth, and read/write latencies, offering insights into system efficiency and resource utilization over time.

VAST Grafana Dashboard

Purpose

PromQL Expression

Read IOPS

rate(vast_view_metrics_ViewMetrics_read_iops_count[5m])

Read Bandwidth

rate(vast_view_metrics_ViewMetrics_read_bw_sum[5m])

Read Latency

rate(vast_view_metrics_ViewMetrics_read_latency_sum[5m]) / rate(vast_view_metrics_ViewMetrics_read_latency_count[5m])

Write IOPS

rate(vast_view_metrics_ViewMetrics_write_iops_count[5m])

Write Bandwidth

rate(vast_view_metrics_ViewMetrics_write_bw_sum[5m])

Write Latency

rate(vast_view_metrics_ViewMetrics_write_latency_sum[5m]) / rate(vast_view_metrics_ViewMetrics_write_latency_count[5m])

QoS Throttling

rate(vast_view_metrics_ViewMetrics_qos_wait_for_budget_time_sum[5m]) / rate(vast_view_metrics_ViewMetrics_qos_wait_for_budget_time_count[5m])

Derived Metrics (from Version 5.3 and higher)

If PromQL is too complex or unsupported, VAST offers derived metrics. These metrics are based on periodic averages and are less accurate over longer time windows due to the averaging characteristics:

Purpose

Metric Name

Read IOPS

vast_view_metrics_ViewMetrics_read_iops_time_avg

Read Bandwidth

vast_view_metrics_ViewMetrics_read_bw_sum_time_avg

Read Latency

vast_view_metrics_ViewMetrics_read_latency_avg

Write IOPS

vast_view_metrics_ViewMetrics_write_iops_time_avg

Write Bandwidth

vast_view_metrics_ViewMetrics_write_bw_time_avg

Write Latency

vast_view_metrics_ViewMetrics_write_latency_avg

QoS Throttling

vast_view_metrics_ViewMetrics_qos_wait_for_budget_time_avg

Command line:

vastpy-cli --json get monitors/ad_hoc_query object_type=view time_frame=5m object_ids=3 prop_list=ViewMetrics,read_bw__time_avg prop_list=ViewMetrics,read_iops__time_avg prop_list=ViewMetrics,read_latency__avg

Output format:                                                                                                                                         

"prop_list": [

    "timestamp",

    "object_id",

    "ViewMetrics,read_bw__time_avg",

    "ViewMetrics,read_iops__time_avg",

    "ViewMetrics,read_latency__avg"

  ],

QoS Metrics Overview

Metrics / Concept

Description

vast_view_metrics_ViewMetrics_qos_wait_for_budget_time_avg

Windowed mean time requests in this view spent waiting on QoS budget during the scrape window. Indicates presence/degree of QoS gating. Mostly >0 since it measures the time a code section takes, which is part of IO processing.

vast_view_metrics_ViewMetrics_qos_wait_for_budget_time_sum

Cumulative seconds of QoS wait accrued by the view (monotonic; use rate() for per-second).

vast_view_metrics_ViewMetrics_qos_wait_for_budget_time_count

Cumulative count of affected events included in the *_sum. Pair with sum to derive mean added latency per affected operation: rate(sum)/rate(count).

vast_view_metrics_ViewMetrics_read_bw_avg vast_view_metrics_ViewMetrics_write_bw_avg

Windowed average delivered bandwidth for the view (bytes/s). It is useful to see if throughput is at or near the configured QoS cap. *

vast_view_metrics_ViewMetrics_read_iops_time_avg vast_view_metrics_ViewMetrics_write_iops_time_avg

Windowed average IOPS for the view (ops/s). Helps separate small-IO vs. streaming patterns. *

vast_user_read_bw

vast_user_write_bw

Per-user windowed average bandwidth (bytes/s). Complements view-level utilization.

vast_user_read_iops

vast_user_write_iops

Per-user windowed average IOPS (ops/s).

Notes:

  1. Window length for “*_avg” metrics in the averaging window equals your Prometheus scrape_interval (e.g., 15s), unless configured differently (prometheus.yml)

  2. Scopes & Endpoints:

    • /api/prometheusmetrics/views → per-view (QoS, performance, etc.)

    • /api/prometheusmetrics/users → per-user (bandwidth, IOPS, etc.)

  3. HELP/TYPE lines: Each series carries # HELP <metric> <description> and # TYPE <metric> <type>. Treat these as the authoritative contract for your cluster/build.

Capacity Usage Monitoring

For billing and accurate tenant metering, VAST recommends using quota-based capacity tracking. Quotas provide the most accurate accounting model because usage is tracked directly at the quota path level and can be aggregated per tenant.

Quota-based accounting exposes both:

  • Logical capacity (used_capacity)

  • Effective capacity (used_effective_capacity)

Quota-Based Capacity Tracking

Capacity usage can be retrieved using the /quotas REST API endpoint:

curl -sku admin:****** \
"https://vast-file-server-vms/api/latest/quotas/?tenant_name=tenant-a" | \
jq '.[] | {
  name,
  tenant_name,
  path,
  used_capacity,
  used_effective_capacity,
  used_capacity_tb,
  used_effective_capacity_tb
}'

Example response:

{
  "name": "datasets-quota",
  "tenant_name": "tenant-a",
  "path": "/",
  "used_capacity": 1511299156932695,
  "used_effective_capacity": 1511299156932695,
  "used_capacity_tb": 1374.519,
  "used_effective_capacity_tb": 1374.519
}

Note: Multiple quota paths can be aggregated to calculate total tenant capacity usage.

Tenant-Level Capacity Monitoring (Prometheus)

The /prometheusmetrics/tenants endpoint exposes tenant-level logical capacity metrics that CSPs can use for tenant monitoring, dashboards, and billing workflows.

Example query:

curl -sku admin:******** \
"https://vast-file-server-vms/api/latest/prometheusmetrics/tenants"

Example metrics:

vast_tenant_metrics_TenantMetrics_logical_capacity_avg
vast_tenant_metrics_TenantMetrics_logical_capacity_sum
vast_tenant_metrics_TenantMetrics_logical_capacity_count

Note: Tenant capacity limits must be enabled for tenant-level capacity metrics to be exported through the /prometheusmetrics/tenants endpoint.

Quota, User, and Group Capacity Monitoring (Prometheus)

The /prometheusmetrics/quotas endpoint exposes quota, per-user (UID), and per-group (GID) capacity metrics.

Example query:

curl -sku admin:******* \
"https://vast-file-server-vms/api/latest/prometheusmetrics/quotas"

Example metrics:

vast_quota_used_capacity              Quota Used Capacity
vast_user_quota_used_capacity         User Quota Used Capacity
vast_user_quota_percent_capacity      User Quota Capacity Percent Used
vast_group_quota_used_capacity        Group Quota Used Capacity
vast_group_quota_percent_capacity     Group Quota Capacity Percent Used

Note:

  • Quotas must be enabled for quota metrics to be exported, even if no hard or soft quota limits are configured.

  • User/group quota tracking can be enabled without enforcing actual capacity limits.

Alternative Capacity Monitoring Approaches

The following methods can also be used for tenant capacity monitoring, but they are less accurate than quota-based accounting and should be used primarily when quota tracking is unavailable.

Capacity via View API

Returns logical capacity per View:

curl -sku admin:****** \
"https://vast-file-server-vms/api/latest/views/?tenant_name=acme" | \
jq '.[] | {name, path, logical_capacity}'

Note: If the tenant has multiple Views or buckets, the values must be aggregated externally for tenant-level accounting.

The /prometheusmetrics/views endpoint also exposes View-level capacity metrics.

Example query:

curl -sku admin:****** \
"https://vast-file-server-vms/api/latest/prometheusmetrics/views"

Example metrics

# HELP vast_view_logical_capacity View Logical Capacity
# HELP vast_view_physical_capacity View Physical Capacity

Capacity Estimation API

Estimates capacity usage for a specific filesystem path:

curl -sku admin:****** \
"https://vast-file-server-vms/api/latest/capacity/capacity_estimation?tenant_name=acme&path=/"

Note: capacity_estimation is path-based and requires an explicit filesystem path. It cannot estimate usage for an entire tenant without path aggregation.

Grafana Dashboard Reference

Use or customize VAST’s official Grafana dashboards to visualize UID usage:

Client-Side Observability (NFS only)

VAST's vNFS Collector is an open-source tool that provides deep visibility into NFS workloads by capturing detailed I/O metrics for every NFS mount. It tracks per-operation counters for all key NFSv3 and NFSv4 commands, including READ, WRITE, LOOKUP, and DELETE, along with contextual metadata such as mount points, process names, user IDs, and environment variables like SLURM JOB ID. This rich dataset enables accurate workload profiling and performance tuning.

The collector supports flexible data forwarding, with local JSON logging and seamless integration into Prometheus (for Grafana dashboards), Kafka (for event-driven pipelines), and the VAST DataBase (for historical analytics via Trino, Spark, and Grafana).

VAST CSI Driver Prometheus Metrics

In Kubernetes environments, the VAST CSI Driver also supports exporting CSI node and controller metrics in Prometheus format, enabling observability for storage provisioning, mount operations, CSI RPC performance, and NFS transport health. These metrics can be integrated with Prometheus and Grafana to support operational monitoring and troubleshooting of containerized workloads running on VAST. For more details, see the CSI metrics guide: Exporting VAST CSI Driver Metrics to Prometheus


For more information, visit: