Exporting VAST CSI Driver Metrics to Prometheus

Prev Next

VAST CSI Driver can be configured to expose CSI node and controller metrics in Prometheus format. The node metrics include total counts and average durations for CSI RPCs and mounts/umounts, and also NFS transport (xprt) statistics. The controller metrics include total counts and average durations for CSI RPCs.

Enabling Export of CSI Metrics

By default, the driver does not expose any metrics.

To enable export of metrics:

  1. Add the following to the driver's Helm chart configuration file:

    node:
      metrics:
        enabled: true
        port: 9090
    controller:
      metrics:
        enabled: true
        port: 9091
  1. Install or upgrade the driver's Helm chart.

Exposed CSI Metrics Endpoints and Ports

When metrics export is enabled:

  • A headless service is created that serves metrics requests at two endpoints:

    • GET /metrics for getting the metrics in Prometheus format (counters, histograms, gauges),

    • GET /health for health checks.

  • The node's DaemonSet pods expose the node metrics port 9090.

  • The controller's Deployment/StatefulSet pods expose the controller metrics port 9091.
    NOTE: You can override default ports by specifying a different value in the port entry under node or controller metrics in the driver's Helm chart configuration file.

Exported CSI Metrics

NOTE: For a complete reference on CSI metrics, see https://github.com/vast-data/vast-csi/blob/v2.6/docs/METRICS_REFERENCE.md.

CSI Node Metrics

  • Mounts/umounts

    csi_node_mount_operations_total

    Total number of mounts (of a PVC to a pod)

    csi_node_mount_duration_seconds

    Duration of mounts (in seconds)

    csi_node_umount_operations_total

    Total number of umounts

    csi_node_umount_duration_seconds

    Duration of umounts (in seconds)

  • NFS transport (xprt) status per cluster

    csi_node_nfs_xprt_total

    Total number of active NFS transports

    csi_node_nfs_xprt_connected

    Number of connected NFS transports

    csi_node_nfs_xprt_pending_requests_total

    Total number of pending requests across all mounts

    csi_node_nfs_xprt_backlog_total

    Total number of backlog requests across all mounts

    csi_node_nfs_xprt_unhealthy

    Number of unhealthy NFS transports

  • NFS transport (xprt) status per virtual IP
    These metrics can only be exported while the virtual IP is connected.

    csi_node_nfs_xprt_connected_state

    1.0 indicates a healthy connection. 0.0 means disconnected.

    csi_node_nfs_xprt_congested_state

    1.0 means flow control is active.

    csi_node_nfs_xprt_locked_state

    1.0 indicates that the connection is locked.

    csi_node_nfs_xprt_pending_requests

    Number of RPC calls waiting for a response

    csi_node_nfs_xprt_backlog_depth

    Backlog queue depth for this virtual IP

    csi_node_nfs_xprt_mounts

    Number of active NFS mounts for this virtual IP

CSI Controller Metrics

csi_plugin_operations_total

Total number of all CSI gRPC method calls (CreateVolume, DeleteVolume, ControllerPublishVolume, and so on)

csi_plugin_operations_seconds

Average duration of a CSI gRPC method call

Metrics Details

In addition to the measured value per metric type, a metric may include labels that provide additional information about the measured operation. For example, the following metric:

csi_node_mount_operations_total{operation_type="nfs",status="success",node_name="worker-node-1",pvc_namespace="prod"} 15

specifies that the measured value was taken for the mounts that:

  • were made through the NFS access protocol,

  • completed successfully,

  • occurred on a worker node named worker-node-1,

  • targeted the prod namespace.

Accessing Exported CSI Metrics

NOTE: For more detailed guidance, see https://github.com/vast-data/vast-csi/blob/v2.6/docs/METRICS_GUIDE.md.

Run the following commands to verify that the metrics endpoints work as expected:

  • For node metrics:

    kubectl get pods -n vast-csi -l app.kubernetes.io/component=csi-node
    kubectl port-forward -n vast-csi pod/<CSI node pod name> 9090:9090
    curl -s http://localhost:9090/metrics
    curl -s http://localhost:9090/health
  • For controller metrics:

    kubectl get pods -n vast-csi -l app.kubernetes.io/component=csi-controller
    kubectl port-forward -n vast-csi pod/<CSI controller pod name> 9091:9091
    curl -s http://localhost:9091/metrics
    curl -s http://localhost:9091/health

Sample Metrics Values for Common Scenarios

The following illustrates typical metrics values in common scenarios.

NOTE: For more examples, see https://github.com/vast-data/vast-csi/blob/v2.6/docs/METRICS_EXAMPLES.md.

  • Upon creating a pod:

    • Aggregate metrics are set to 0.

    • Counters are not reported.

    • csi_node_nfs_xprt_connected_state, csi_node_nfs_xprt_congested_state and csi_node_nfs_xprt_pending_requests are not reported.

  • After mounting one PVC to virtual IP 192.168.1.10:

    csi_node_mount_operations_total{node_name="worker-1",operation_type="nfs",pvc_namespace="default",status="success"} 1.0
    csi_node_mount_duration_seconds_sum{...} 0.823
    csi_node_mount_duration_seconds_count{...} 1.0
    csi_node_nfs_xprt_total 1.0
    csi_node_nfs_xprt_connected 1.0
    csi_node_nfs_xprt_connected_state{destination="192.168.1.10"} 1.0
  • If the VAST cluster's virtual IP becomes unreachable:

    • csi_node_mount_duration_seconds is at approx. 30 seconds (or your configured timeout)

    • csi_node_nfs_xprt_total is 0.0.

  • In case of network congestion/high latency:

    • csi_node_nfs_xprt_congested_state is 1.0.

    • csi_node_nfs_xprt_pending_requests exceeds 100.

    • csi_node_nfs_xprt_unhealthy is set to a non-zero value.

  • On pod deletion:

    • csi_node_mount_operations_total stays at its last value.

    • csi_node_umount_operations_total is reported with a non-zero value.

    • After approx. 30 seconds, csi_node_nfs_xprt_total drops to 0.0.