Overview
The Ethernet network monitoring feature provides insight into network health, congestion, and protocol-specific behaviors on a cluster's Ethernet networks.
The feature uses the OpenTelemetry framework for collecting and exporting metrics, logs, and traces from Ethernet switches in the cluster.
Metrics can be displayed through predefined and custom VMS analytics reports. They can also be exported to a Prometheus endpoint.
OpenTelemetry can also trigger VMS events and alarms.
To enable network monitoring, you configure OpenTemetry on supported switches.
Switches that are configured with OpenTelemetry are automatically discovered by VMS and listed in cluster switch listings. However, as with manually added switches, you need to provide VMS with switch credentials in order to fetch the switch and port properties.
Limitations
TLS encryption is not supported on the SSH connection used to obtain the metrics from the switch.
Supported on NVIDIA Cumulus switches from version 5.12.1. For other switch vendors, please inquire with VAST Customer Success.
Supported on clusters with no more than 70 switches.
Which Metrics are Collected?
Switch ARP Packet Violations. The number of packets dropped due to exceeding the ARP packet policer limit.
Switch BGP Peer State. BGP connection state.
Switch Physical Layer Errors. Number of physical layer errors.
Switch RX Bandwidth. Receive bandwidth.
Switch TX Bandwidth. Transmit bandwidth.
Switch Buffer Usage Percentage for Traffic Class 0. Utilization (%) of switch transmit and receive buffers allocated to low-priority traffic
Switch Buffer Usage Percentage for Traffic Class 3. Utilization (%) of switch transmit and receive buffers allocated to high-priority traffic.
Switch RX Buffer Discards for Traffic Class 0. Number of packets dropped within the switch due to lack of receive buffers for low-priority traffic
Switch RX Buffer Discards for Traffic Class 3. Amount of packets dropped within the switch due to lack of receive buffers for high-priority traffic
Switch RX Discards for Traffic Class 3. Overall number of discarded receive packets for high-priority traffic
Switch RX Paused Packets for Traffic Class 3. Number of paused receive packets within the switch for high-priority traffic.
Switch TX Paused Packets for Traffic Class 3. Number of paused transmit packets within the switch for high-priority traffic.
Configuring the Switches
Configure your switches to send OpenTelemetry data over gRPC to the VMS management IP address, using port 4317.
Switch Monitoring Analytics
The metrics are available as predefined analytics reports, and you can also define custom analytics reports. For information about viewing and defining analytics reports, see Analytics Reports. Select Switch as the object type.
It is recommended to enable Intersampling mode when viewing the analytics. All of the predefined switch analytics, except BGP connection state and buffer usage percentage are monotonically increasing. To observe changes in the metrics, enable 'intersampling' on the analytics screen.
Exporting Analytics with Prometheus
The metrics can also be exported to Prometheus using the /api/prometheusmetrics/switches endpoint.
Alarms
Alarms are raised for each switch metric according to modifiable event definitions. For BGP peer state and Buffer usage percentage, these are threshold type events (see Event Types). Metrics are sampled every 30 seconds. For the other metrics, the events are rate type events. By default, the rate type switch events are configured to monitor changes in the metrics over ten minute periods. This means that the underlying event could have occurred any time within the 10-minute period prior to the alarms.
The alarms are raised according to modifiable event definitions. To find and adjust the relevant event definitions, go to the Alarms and Events page, select the Event Definitions tab, filter the event definitions by object type Switch.
For more information about alarms and events, see Alarms and Events.