VAST CSI Driver can be configured to expose CSI node and controller metrics in Prometheus format. The node metrics include total counts and average durations for CSI RPCs and mounts/umounts, and also NFS transport (xprt) statistics. The controller metrics include total counts and average durations for CSI RPCs.
Enabling Export of CSI Metrics
By default, the driver does not expose any metrics.
To enable export of metrics:
Add the following to the driver's Helm chart configuration file:
node: metrics: enabled: true port: 9090 controller: metrics: enabled: true port: 9091
Exposed CSI Metrics Endpoints and Ports
When metrics export is enabled:
A headless service is created that serves metrics requests at two endpoints:
GET /metricsfor getting the metrics in Prometheus format (counters, histograms, gauges),GET /healthfor health checks.
The node's
DaemonSetpods expose the node metrics port 9090.The controller's
Deployment/StatefulSetpods expose the controller metrics port 9091.
NOTE: You can override default ports by specifying a different value in the port entry under node or controller metrics in the driver's Helm chart configuration file.
Exported CSI Metrics
NOTE: For a complete reference on CSI metrics, see https://github.com/vast-data/vast-csi/blob/v2.6/docs/METRICS_REFERENCE.md.
CSI Node Metrics
Mounts/umounts
csi_node_mount_operations_totalTotal number of mounts (of a PVC to a pod)
csi_node_mount_duration_secondsDuration of mounts (in seconds)
csi_node_umount_operations_totalTotal number of umounts
csi_node_umount_duration_secondsDuration of umounts (in seconds)
NFS transport (xprt) status per cluster
csi_node_nfs_xprt_totalTotal number of active NFS transports
csi_node_nfs_xprt_connectedNumber of connected NFS transports
csi_node_nfs_xprt_pending_requests_totalTotal number of pending requests across all mounts
csi_node_nfs_xprt_backlog_totalTotal number of backlog requests across all mounts
csi_node_nfs_xprt_unhealthyNumber of unhealthy NFS transports
NFS transport (xprt) status per virtual IP
These metrics can only be exported while the virtual IP is connected.csi_node_nfs_xprt_connected_state1.0indicates a healthy connection.0.0means disconnected.csi_node_nfs_xprt_congested_state1.0means flow control is active.csi_node_nfs_xprt_locked_state1.0indicates that the connection is locked.csi_node_nfs_xprt_pending_requestsNumber of RPC calls waiting for a response
csi_node_nfs_xprt_backlog_depthBacklog queue depth for this virtual IP
csi_node_nfs_xprt_mountsNumber of active NFS mounts for this virtual IP
CSI Controller Metrics
| Total number of all CSI gRPC method calls (CreateVolume, DeleteVolume, ControllerPublishVolume, and so on) |
| Average duration of a CSI gRPC method call |
Metrics Details
In addition to the measured value per metric type, a metric may include labels that provide additional information about the measured operation. For example, the following metric:
csi_node_mount_operations_total{operation_type="nfs",status="success",node_name="worker-node-1",pvc_namespace="prod"} 15specifies that the measured value was taken for the mounts that:
were made through the NFS access protocol,
completed successfully,
occurred on a worker node named
worker-node-1,targeted the
prodnamespace.
Accessing Exported CSI Metrics
NOTE: For more detailed guidance, see https://github.com/vast-data/vast-csi/blob/v2.6/docs/METRICS_GUIDE.md.
Run the following commands to verify that the metrics endpoints work as expected:
For node metrics:
kubectl get pods -n vast-csi -l app.kubernetes.io/component=csi-node kubectl port-forward -n vast-csi pod/<CSI node pod name> 9090:9090 curl -s http://localhost:9090/metrics curl -s http://localhost:9090/health
For controller metrics:
kubectl get pods -n vast-csi -l app.kubernetes.io/component=csi-controller kubectl port-forward -n vast-csi pod/<CSI controller pod name> 9091:9091 curl -s http://localhost:9091/metrics curl -s http://localhost:9091/health
Sample Metrics Values for Common Scenarios
The following illustrates typical metrics values in common scenarios.
NOTE: For more examples, see https://github.com/vast-data/vast-csi/blob/v2.6/docs/METRICS_EXAMPLES.md.
Upon creating a pod:
Aggregate metrics are set to 0.
Counters are not reported.
csi_node_nfs_xprt_connected_state,csi_node_nfs_xprt_congested_stateandcsi_node_nfs_xprt_pending_requestsare not reported.
After mounting one PVC to virtual IP 192.168.1.10:
csi_node_mount_operations_total{node_name="worker-1",operation_type="nfs",pvc_namespace="default",status="success"} 1.0 csi_node_mount_duration_seconds_sum{...} 0.823 csi_node_mount_duration_seconds_count{...} 1.0 csi_node_nfs_xprt_total 1.0 csi_node_nfs_xprt_connected 1.0 csi_node_nfs_xprt_connected_state{destination="192.168.1.10"} 1.0If the VAST cluster's virtual IP becomes unreachable:
csi_node_mount_duration_secondsis at approx. 30 seconds (or your configured timeout)csi_node_nfs_xprt_totalis0.0.
In case of network congestion/high latency:
csi_node_nfs_xprt_congested_stateis1.0.csi_node_nfs_xprt_pending_requestsexceeds 100.csi_node_nfs_xprt_unhealthyis set to a non-zero value.
On pod deletion:
csi_node_mount_operations_total staysat its last value.csi_node_umount_operations_totalis reported with a non-zero value.After approx. 30 seconds,
csi_node_nfs_xprt_totaldrops to0.0.