Exporting VAST CSI Driver Metrics to Prometheus

VAST CSI Driver can be configured to expose CSI node and controller metrics in Prometheus format. The node metrics include total counts and average durations for CSI RPCs and mounts/umounts, and also NFS transport (xprt) statistics. The controller metrics include total counts and average durations for CSI RPCs.

Enabling Export of CSI Metrics

By default, the driver does not expose any metrics.

To enable export of metrics:

Add the following to the driver's Helm chart configuration file:

node:
  metrics:
    enabled: true
    port: 9090
controller:
  metrics:
    enabled: true
    port: 9091

Install or upgrade the driver's Helm chart.

Exposed CSI Metrics Endpoints and Ports

When metrics export is enabled:

A headless service is created that serves metrics requests at two endpoints:
- GET /metrics for getting the metrics in Prometheus format (counters, histograms, gauges),
- GET /health for health checks.
The node's DaemonSet pods expose the node metrics port 9090.
The controller's Deployment/StatefulSet pods expose the controller metrics port 9091.
NOTE: You can override default ports by specifying a different value in the port entry under node or controller metrics in the driver's Helm chart configuration file.

Exported CSI Metrics

NOTE: For a complete reference on CSI metrics, see https://github.com/vast-data/vast-csi/blob/v2.6/docs/METRICS_REFERENCE.md.

CSI Node Metrics

Mounts/umounts

`csi_node_mount_operations_total`	Total number of mounts (of a PVC to a pod)
`csi_node_mount_duration_seconds`	Duration of mounts (in seconds)
`csi_node_umount_operations_total`	Total number of umounts
`csi_node_umount_duration_seconds`	Duration of umounts (in seconds)

NFS transport (xprt) status per cluster

`csi_node_nfs_xprt_total`	Total number of active NFS transports
`csi_node_nfs_xprt_connected`	Number of connected NFS transports
`csi_node_nfs_xprt_pending_requests_total`	Total number of pending requests across all mounts
`csi_node_nfs_xprt_backlog_total`	Total number of backlog requests across all mounts
`csi_node_nfs_xprt_unhealthy`	Number of unhealthy NFS transports

NFS transport (xprt) status per virtual IP
These metrics can only be exported while the virtual IP is connected.

`csi_node_nfs_xprt_connected_state`	`1.0` indicates a healthy connection. `0.0` means disconnected.
`csi_node_nfs_xprt_congested_state`	`1.0` means flow control is active.
`csi_node_nfs_xprt_locked_state`	`1.0` indicates that the connection is locked.
`csi_node_nfs_xprt_pending_requests`	Number of RPC calls waiting for a response
`csi_node_nfs_xprt_backlog_depth`	Backlog queue depth for this virtual IP
`csi_node_nfs_xprt_mounts`	Number of active NFS mounts for this virtual IP

CSI Controller Metrics

`csi_plugin_operations_total`	Total number of all CSI gRPC method calls (CreateVolume, DeleteVolume, ControllerPublishVolume, and so on)
`csi_plugin_operations_seconds`	Average duration of a CSI gRPC method call

Metrics Details

In addition to the measured value per metric type, a metric may include labels that provide additional information about the measured operation. For example, the following metric:

csi_node_mount_operations_total{operation_type="nfs",status="success",node_name="worker-node-1",pvc_namespace="prod"} 15

specifies that the measured value was taken for the mounts that:

were made through the NFS access protocol,
completed successfully,
occurred on a worker node named worker-node-1,
targeted the prod namespace.

Accessing Exported CSI Metrics

NOTE: For more detailed guidance, see https://github.com/vast-data/vast-csi/blob/v2.6/docs/METRICS_GUIDE.md.

Run the following commands to verify that the metrics endpoints work as expected:

For node metrics:

kubectl get pods -n vast-csi -l app.kubernetes.io/component=csi-node
kubectl port-forward -n vast-csi pod/<CSI node pod name> 9090:9090
curl -s http://localhost:9090/metrics
curl -s http://localhost:9090/health

For controller metrics:

kubectl get pods -n vast-csi -l app.kubernetes.io/component=csi-controller
kubectl port-forward -n vast-csi pod/<CSI controller pod name> 9091:9091
curl -s http://localhost:9091/metrics
curl -s http://localhost:9091/health

Sample Metrics Values for Common Scenarios

The following illustrates typical metrics values in common scenarios.

NOTE: For more examples, see https://github.com/vast-data/vast-csi/blob/v2.6/docs/METRICS_EXAMPLES.md.

Upon creating a pod:
- Aggregate metrics are set to 0.
- Counters are not reported.
- csi_node_nfs_xprt_connected_state, csi_node_nfs_xprt_congested_state and csi_node_nfs_xprt_pending_requests are not reported.

After mounting one PVC to virtual IP 192.168.1.10:

csi_node_mount_operations_total{node_name="worker-1",operation_type="nfs",pvc_namespace="default",status="success"} 1.0
csi_node_mount_duration_seconds_sum{...} 0.823
csi_node_mount_duration_seconds_count{...} 1.0
csi_node_nfs_xprt_total 1.0
csi_node_nfs_xprt_connected 1.0
csi_node_nfs_xprt_connected_state{destination="192.168.1.10"} 1.0

If the VAST cluster's virtual IP becomes unreachable:
- csi_node_mount_duration_seconds is at approx. 30 seconds (or your configured timeout)
- csi_node_nfs_xprt_total is 0.0.
In case of network congestion/high latency:
- csi_node_nfs_xprt_congested_state is 1.0.
- csi_node_nfs_xprt_pending_requests exceeds 100.
- csi_node_nfs_xprt_unhealthy is set to a non-zero value.
On pod deletion:
- csi_node_mount_operations_total stays at its last value.
- csi_node_umount_operations_total is reported with a non-zero value.
- After approx. 30 seconds, csi_node_nfs_xprt_total drops to 0.0.