Capacity Analysis with vastpy-cli

Prev Next

Summary

This document provides a comprehensive guide on how to analyze storage capacity and file data on a VAST Data platform using vastpy-cli. We cover both the direct capacity queries and the more advanced catalog-based analysis.

Methods

Via VAST REST API

We will cover two methods of obtaining this data.  The first, using the VAST Capacity Estimations data, leverages the VAST REST API (via the vastpy-cli utility) to query the capacity statistics for any directory within the VAST Platform.  This API is accessible on all versions of VAST from 4.0 onwards, and does not require any configuration.  For more information on this data, visit Capacity Overview.  

This approach is useful if:

  1. The VAST Catalog is not enabled on your cluster

  2. You want to gain insight into the DataReduction ratio on different folders on the system.

Via VAST Catalog

The Second approach is to use vastpy-cli to query the VAST Catalog, a feature that has been available since the 4.6 release.  This does require that the VAST Catalog is enabled; see VAST Catalog Overview for more details on the feature, as well as how to configure.  This approach is preferred if:

  1. The VAST Catalog is enabled

  2. You want more granular insight and filtering capabilities.  Examples include:

    • Determine capacity used per user / group

    • Determine capacity per file extension

    • Find individual file locations based on a filename search

    • Much, much more.

Regardless of which of these approaches you choose, you will be using the same basic tool (vastpy-cli).  There are other methods of accessing this data, which will be covered in another document.

Prerequisites

A host with:

  • Python installed

  • Network access to the VMS IP address, specifically on port 443 (HTTPS)

  • VMS Manager account with appropriate permissions

Setting Up vastpy-cli

Installing vastpy

Install the vastpy package, which includes the vastpy-cli command:

pip install --upgrade vastpy

You may also want to install the jq utility:

sudo apt-get install -y jq #ubuntu/debian variants
sudo yum install -y jq #rocky/rhel variants

Setting Up Authentication

Set up environment variables for authentication (recommended method):

# Set these environment variables once per session
export VMS_USER=admin
 export VMS_PASSWORD=your_password  # Add a space before this command to avoid storing the password in command history
export VMS_ADDRESS=vast-vms-address

Alternatively, you can specify credentials with each command:

vastpy-cli --user=admin --password=your_password --address=vast-vms-address [command]

1. Analyzing Directory Capacity with vastpy-cli

The capacity subcommand in vastpy-cliallows you to retrieve detailed capacity information for specific directories in your VAST Data platform.

Generally, filtering is done by specifying the path you wish to analyze.  It can be a top-level directory or any subdirectory within the system.  Note that the data that underpins this API is an estimation based on sampling and should not be used for exact calculations.  Also, there may be a delay between the time that data is ingested into a folder and the time it appears in the statistics.  This is because sampling can occur as the data is migrated from SCM -> QLC flash, which occurs asynchronously.  Lastly, folders that do not have enough aggregate capacity will not appear on their own in these metrics.  If you are unsure of the information you are seeing, contact CS with specific commands and output, so they can best assist.

Basic Usage

To get capacity information for a specific directory:

# Basic capacity command
vastpy-cli get capacity path=/path/to/directory

Retrieving JSON Output

For more structured output and easier parsing, you can request JSON format:

# Get capacity information in JSON format
vastpy-cli get --json capacity path=/path/to/directory

Example with jq Parsing

You can pipe the JSON output to tools like jq for further filtering and analysis:

# Get available capacity keys
vastpy-cli get --json capacity path=/maria | jq '.keys'

Output:

[
  "usable",
  "unique",
  "logical"
]

To get detailed information about the directory:

# Get detailed capacity information for the directory
vastpy-cli get --json capacity path=/maria | jq '.details[0]'

Output:

[
  "/maria",
  {
    "data": [
      7546623507,
      5237582703,
      25589375620
    ],
    "parent": "/",
    "percent": 100,
    "average_atime": "2025-03-29 07:30"
  }
]

(The output above is in bytes)

Understanding Capacity Output

The capacity output contains three key metrics:

  1. Usable - The usable space in bytes on the physical media

  2. Unique - The unique space in bytes (after deduplication)

  3. Logical - The logical space in bytes (what users/applications see)

The average_atime field shows the average last access time for files in the directory, which can be helpful for understanding data usage patterns.

2. Analyzing Data with VAST Catalog

The VAST Catalog provides a more powerful way to query and analyze your data, especially for large datasets. You can use the bigcatalogconfig subcommand in vastpy-cli.

Some notes:

  1. Depending on the filter and the number of files on the system, these queries can take some time to complete.  Most should complete within seconds, but some may take longer

  2. There are no built-in 'aggregation' functions in this API at the moment.  If you wish to determine the capacity of a folder, for example, you have the following options:

  3. It is typically recommended to use a limit=50 when first experimenting with queries, so that it doesn't overrun your terminal or output.  If you choose to perform aggregations using jq Please remove the limit, or else your calculations will be incorrect.

Using vastpy-cli for Catalog Queries

First, an example to list all files that were not accessed in the last 90 days:

# Basic catalog query for files not accessed in the last 90 days
vastpy-cli post --json bigcatalogconfig/query_data \
  limit=50 \
  path="/vperfsanity" \
  fields='["parent_path", "name", "atime", "size"]' \
  filters='{"atime": [{"lt": "'$(date -d '90 days ago' +%s)'000"}]}' | jq

Output (truncated):

{
  "count": 256,
  "next": null,
  "prop_list": [
    "parent_path",
    "name",
    "atime",
    "size"
  ],
  "results": [
    [
      "/vperfsanity/",
      "vperfsanity21",
      "2025-01-14T20:14:49.499298",
      4194304000
    ],
    [
      "/vperfsanity/",
      "vperfsanity155",
      "2025-01-14T20:14:47.415081",
      4194304000
    ]
  ]
}

Same query, but this time only retrieve the size field and perform a sum (notice that we removed the limit from this example to get an accurate sum.

vastpy-cli post --json bigcatalogconfig/query_data \
  path="/vperfsanity" \
  fields='["size"]' \
  filters='{"atime": [{"lt": "'$(date -d '90 days ago' +%s)'000"}]}' | jq '([.results[][]] | add)'

Output:

1073741824000

Obviously, you may want to convert from bytes into something more useful, like GiB or TiB:

vastpy-cli post --json bigcatalogconfig/query_data \
  path="/vperfsanity" \
  fields='["size"]' \
  filters='{"atime": [{"lt": "'$(date -d '90 days ago' +%s)'000"}]}' | jq '([.results[][]] | add) / (1024*1024*1024)'

Advanced Query Examples

Finding Large Files

# Find files larger than 1GB
vastpy-cli post --json bigcatalogconfig/query_data \
  limit=50 \
  path="/vperfsanity" \
  fields='["parent_path", "name", "size"]' \
  filters='{"size": [{"gt": 1073741824}]}'

Finding Recently Modified Files

(Note the use of gt as opposed to lt)

# Find files modified in the last day
vastpy-cli post --json bigcatalogconfig/query_data \
  limit=50 \
  path="/" \
  fields='["parent_path", "name", "mtime"]' \
  filters='{"mtime": [{"gt": "'$(date -d '1 day ago' +%s)'000"}]}' | jq

Finding Files by Owner

# Find files owned by a specific user
vastpy-cli post --json bigcatalogconfig/query_data \
  limit=50 \
  path="/" \
  fields='["parent_path", "name", "login_name"]' \
  filters='{"login_name": "maria.gutierrez@selab.vastdata.com"}'

Find Files by filename

vastpy-cli post --json bigcatalogconfig/query_data \
  limit=50 \
  path="/" \
  fields='["parent_path", "name", "login_name"]' \
  filters='{"name": "comment.mp4"}'

Output

{
  "count": 2,
  "next": null,
  "prop_list": [
    "parent_path",
    "name",
    "login_name"
  ],
  "results": [
    [
      "/scratch/home/rayc-flow/MusM/",
      "comment.mp4",
      null
    ],
    [
      "/scratch/home/jhays-flow/feasibleness/slim/",
      "comment.mp4",
      null
    ]
  ]
}

Tips for Large-Scale Analysis

When working with large datasets:

  1. Filter Early - Apply specific filters to reduce the dataset size

  2. Paginate Results - For large result sets, use pagination to process data in chunks

  3. Schedule Regular Analysis - Set up regular analysis jobs to track capacity trends over time

Conclusion

The VAST capacity and catalog analysis tools provide powerful ways to understand your data usage patterns, identify optimization opportunities, and manage your storage more effectively. By combining direct capacity queries with catalog analysis, you can gain comprehensive insights into your VAST Data platform.

For more detailed information about the VAST API, please refer to your VAST system's documentation at https://<your-vast-vms-address>/docs.