Exporting Metrics to Prometheus

Prev Next

Overview

Prometheus is an open-source systems monitoring and alerting toolkit that provides a data model for describing and recording metrics over time and also provides a web application to display those metrics. Prometheus can be configured to fetch metrics from a third party system by means of a software entity called a Prometheus exporter. A Prometheus exporter collects metrics from a third party system, converts them to Prometheus metrics and exposes them via a resource path.

VAST Cluster provides a Prometheus exporter resource in the VMS REST API. The VAST Prometheus exporter can fetch pre-defined metrics from the VMS database and return the data as a plain/text key-value format response. You can configure the Prometheus server to scrape the metrics from the exporter at a chosen interval and display the VMS metrics through its display applications.

Screen_Shot_2022-08-14_at_15_50_21.png

Example Prometheus Graph of VAST Cluster Metric

NOTE: When exporting metrics with VAST Prometheus Exporter, the metrics have raw values. For example, latency metrics are in microseconds, the bandwidth metrics are in bytes per second, and so on. These units are different from those used by default in VAST Web UI.

How to Configure Prometheus to Collect VMS Metrics

For information about how to configure the Prometheus server to collect metrics from the VMS Prometheus exporter, read about Prometheus configuration.

The following are guidelines for providing some of the key parameters in the scrape_configs section of the Prometheus server configuration file:

  • metrics_path. This is the HTTP resource path from which to fetch metrics from VMS. Set it to one of the following:

    • /api/prometheusmetrics/alarms

      Exports all active VAST Cluster alarms.

    • /api/prometheusmetrics/basic_no_views

      Exports lightweight cluster metrics including node, box, physical/logical capacity, and performance data, but excludes all view-related metrics for faster collection.

      This endpoint is available from VAST Cluster 5.4.3.

    • /api/prometheusmetrics/defrag

      Exports metrics related to defragmentation.

    • /api/prometheusmetrics/devices

      Provides information about the SSD or NVRAM physical state, such as presence of media errors or current temperature, and overall operational status (active or failed).

    • /api/prometheusmetrics/host_view

      Performance metrics per client host per view.

      This endpoint is available from VAST Cluster 5.4.3.

    • /api/prometheusmetrics/kafka_targets

      Tracks external Kafka event streaming health per broker, monitoring published events, failures, errors, and send failures.

    • /api/prometheusmetrics/nics

      Tracks network interface card health with traffic stats, error counters (TX/RX errors, CRC errors, symbol errors), RDMA-specific metrics (duplicates, out-of-sequence, timeouts), flow control, and cable temperature.

    • /api/prometheusmetrics/quotas

      Provides information related to quotas configured on the cluster, such as the quota limits set and number of users who have exceeded the quota or who have been blocked due to quota exceeded condition.

    • /api/prometheusmetrics/replications

      Exposes replication stream status and health, tracking replication state, progress, and target availability.

      /api/prometheusmetrics/sts
      Monitors STS integration metrics.

    • /api/prometheusmetrics/switches

      Exports network monitoring metrics that are collected from the cluster's switches, provided switch monitoring is enabled (see Ethernet Network Monitoring).

    • /api/prometheusmetrics/tenants
      Provides per-tenant performance metrics including IOPS, bandwidth, latency, metadata operations, QoS wait times, and capacity (logical capacity, DRR, snapshots).

    • /api/prometheusmetrics/users

      Exports user bandwidth, IOPS and metadata IOPS metrics on read and/or write operations.

    • /api/prometheusmetrics/user_connections

      Exports the 100 users with the highest number of active S3 connections, including only users with an attached QoS with a limit > 0.

    • /api/prometheusmetrics/user_view
      Performance metrics per user per view.

    • /api/prometheusmetrics/views

      Exports performance metrics per view, including bandwidth, IOPS, metadata IOPS, latency and QoS, and also view logical and physical capacity.

    • /api/prometheusmetrics/vips
      Monitors virtual IP and virtual IP pool performance with top-N bandwidth metrics (read/write IOPS, bandwidth) and virtual IP pool to CNode mappings.

    • /api/prometheusmetrics/vip_view

      Performance metrics per Virtual IP per view and Virtual IP pool per view.

      This endpoint is available from VAST Cluster 5.4.3.

    • /api/prometheusmetrics/vms_state
      Provides a critical health indicator gauge for VMS status, where 1=CLUSTERED (healthy) and 0=DEGRADED (unhealthy).

      This endpoint is available from VAST Cluster 5.4.3.

    • /api/prometheusmetrics/volumes

      Monitors volume performance with latency, I/O size distributions, and various aggregations for read/write/unmap/compare-and-write operations.

      This endpoint is available from VAST Cluster 5.4.3.

    • /api/prometheusmetrics/

      Exports cluster and CNode metrics that are not exported by the above-listed endpoints. This includes, for example, performance metrics per storage protocol, detailed information about the state of the hardware, and others.

    • /api/prometheusmetrics/all

      Exports all VAST Cluster metrics. This includes each and every metrics that can be exported by the above-listed exporter endpoints. Due to big amount of data being exported, using this endpoint to collect metrics from a very large cluster is not recommended.

  • Under the static configs section, where targets is set to the target IP <EXPORTER_HOST> in the snippet below, specify the cluster's VMS virtual IP in place of <EXPORTER_HOST>. This is the IP that you use to browse to the VAST Web UI. Set the port to 443 as shown in the snippet.

  • To authenticate to the VMS REST API using basic authentication, provide a VMS manager user name and password in the basic_auth section.

    Note

    A VMS manager user granted the minimum read-only role has sufficient permissions for calling the exporter endpoint. The read-only role is a built in default role that you can assign to a manager user. For information about creating and modifying managers and roles, see Authorizing VMS Access and Permissions.Authorizing VMS Access and Permissions

    Note

    When viewing a saved configuration on the Prometheus server, the password is hidden and displayed as a secret. For example:

      ...
    basic_auth:
        username: prometheus
        password: <secret>
    ...

    Note

    VMS REST API supports basic authentication and authentication over HTTPS secured by JSON Web Tokens (JWTs).

    For information about generating and using JWTs, see Authenticating to the VMS REST API in the VMS REST API documentation, which is available at https://<VMS_VIP>/docs/index.html from within your VMS management network (where <VMS_VIP> is your VMS virtual IP address for accessing the VAST Web UI).

  • Set the TLS configuration to verify or to skip client side validation of the VMS SSL certificate as needed to ensure that an HTTPS connection with VMS will succeed. This will depend on your VMS TlS configuration, such as whether you have a CA-signed certificate installed in VMS. See Prometheus configuration instructions for configuration options for specifying the TLS client configuration in the tls_config section. In the example shown in the snippet below, client side validation of the certificate is skipped.Installing an SSL Certificate

  • Setting a scrape timeout of 30 seconds should ensure a response for a larger scale system or load. The scrape interval must be larger than the timeout, so we recommend a scrape interval of one minute.

The following is a snippet of a sample Prometheus server configuration file:

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'codelab-monitor'

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  - job_name: 'vast'
    scheme: https
    scrape_interval: 1m
    scrape_timeout: 30s
    metrics_path: '/api/prometheusmetrics/users'
    static_configs:
        - targets: ['<EXPORTER_HOST>:443']
    basic_auth:
       username: '<USER_NAME>'
       password: '<PASSWORD>'
    tls_config:
        insecure_skip_verify: true

.

Configuring Custom Metrics Labels per Tenant

From VAST Cluster 5.4.3, you can configure up to ten custom metrics labels and associate them with tenants, giving each label a value of your choice for each tenant. Outputs of prometheus metrics then include these labels and their tenant specific values.

For example, suppose you have four tenants, A, B, C and D. A and B both serve customer X while C serves customer Y. D does not serve any customers. You could create a label customer on the cluster for labeling metrics by customer. After creating the label, you can set each of tenants A, B and C to have this label and you can set the label's value in each case to the correct customer name for the tenant. When prometheus metrics are exported, metrics associated with these three tenants will be shown with the label customer, which will be set to the customer name as you set it for the relevant tenant. The label will in fact also appear for tenant D's metrics, with a null value, provided you do not set a default value for the label.

Metrics labels can be configured by a cluster admin user using the VAST CLI or the VMS REST API.

You can use the following VAST CLI commands to work with metrics labels:

Pre-Built Grafana Dashboards

Grafana dashboards are available for importing into your Grafana instance. These dashboards provide statistics and visualisations based on scraped metrics, as follows:

  • Main dashboard. Cluster health and statistics.

  • Space capacity. Space and quotas statistics.

  • CNodes. Performance and hardware statistics per CNode.

  • DNodes - Performance and hardware statistics per DNode, SCM and SSD.

  • Protocols metadata statistics. NFSv3, NFSv4 and S3 metadata latency statistics.

  • Views. Top views, and per-view performance statistics.

  • Users. Top users, and per-user performance statistics.

  • Vips and vippools. Per VIP and VIP Pool statistics.

  • Alarms . Active alarms per component.

online.png

top5.png

To install and configure the Grafana dashboards, download them from our Github repo and import them to your Grafana instance.