Summary
This document provides a comprehensive guide on how to analyze storage capacity and file data on a VAST Data platform using vastpy-cli. We cover both the direct capacity queries and the more advanced catalog-based analysis.
Methods
Via VAST REST API
We will cover two methods of obtaining this data. The first, using the VAST Capacity Estimations data, leverages the VAST REST API (via the vastpy-cli utility) to query the capacity statistics for any directory within the VAST Platform. This API is accessible on all versions of VAST from 4.0 onwards, and does not require any configuration. For more information on this data, visit Capacity Overview.
This approach is useful if:
The VAST Catalog is not enabled on your cluster
You want to gain insight into the DataReduction ratio on different folders on the system.
Via VAST Catalog
The Second approach is to use vastpy-cli to query the VAST Catalog, a feature that has been available since the 4.6 release. This does require that the VAST Catalog is enabled; see VAST Catalog Overview for more details on the feature, as well as how to configure. This approach is preferred if:
The VAST Catalog is enabled
You want more granular insight and filtering capabilities. Examples include:
Determine capacity used per user / group
Determine capacity per file extension
Find individual file locations based on a filename search
Much, much more.
Regardless of which of these approaches you choose, you will be using the same basic tool (vastpy-cli). There are other methods of accessing this data, which will be covered in another document.
Prerequisites
A host with:
Python installed
Network access to the VMS IP address, specifically on port 443 (HTTPS)
VMS Manager account with appropriate permissions
Setting Up vastpy-cli
Installing vastpy
Install the vastpy package, which includes the vastpy-cli command:
pip install --upgrade vastpyYou may also want to install the jq utility:
sudo apt-get install -y jq #ubuntu/debian variants
sudo yum install -y jq #rocky/rhel variants
Setting Up Authentication
Set up environment variables for authentication (recommended method):
# Set these environment variables once per session
export VMS_USER=admin
export VMS_PASSWORD=your_password # Add a space before this command to avoid storing the password in command history
export VMS_ADDRESS=vast-vms-addressAlternatively, you can specify credentials with each command:
vastpy-cli --user=admin --password=your_password --address=vast-vms-address [command]1. Analyzing Directory Capacity with vastpy-cli
The capacity subcommand in vastpy-cliallows you to retrieve detailed capacity information for specific directories in your VAST Data platform.
Generally, filtering is done by specifying the path you wish to analyze. It can be a top-level directory or any subdirectory within the system. Note that the data that underpins this API is an estimation based on sampling and should not be used for exact calculations. Also, there may be a delay between the time that data is ingested into a folder and the time it appears in the statistics. This is because sampling can occur as the data is migrated from SCM -> QLC flash, which occurs asynchronously. Lastly, folders that do not have enough aggregate capacity will not appear on their own in these metrics. If you are unsure of the information you are seeing, contact CS with specific commands and output, so they can best assist.
Basic Usage
To get capacity information for a specific directory:
# Basic capacity command
vastpy-cli get capacity path=/path/to/directoryRetrieving JSON Output
For more structured output and easier parsing, you can request JSON format:
# Get capacity information in JSON format
vastpy-cli get --json capacity path=/path/to/directoryExample with jq Parsing
You can pipe the JSON output to tools like jq for further filtering and analysis:
# Get available capacity keys
vastpy-cli get --json capacity path=/maria | jq '.keys'Output:
[
"usable",
"unique",
"logical"
]To get detailed information about the directory:
# Get detailed capacity information for the directory
vastpy-cli get --json capacity path=/maria | jq '.details[0]'Output:
[
"/maria",
{
"data": [
7546623507,
5237582703,
25589375620
],
"parent": "/",
"percent": 100,
"average_atime": "2025-03-29 07:30"
}
](The output above is in bytes)
Understanding Capacity Output
The capacity output contains three key metrics:
Usable - The usable space in bytes on the physical media
Unique - The unique space in bytes (after deduplication)
Logical - The logical space in bytes (what users/applications see)
The average_atime field shows the average last access time for files in the directory, which can be helpful for understanding data usage patterns.
2. Analyzing Data with VAST Catalog
The VAST Catalog provides a more powerful way to query and analyze your data, especially for large datasets. You can use the bigcatalogconfig subcommand in vastpy-cli.
Some notes:
Depending on the filter and the number of files on the system, these queries can take some time to complete. Most should complete within seconds, but some may take longer
There are no built-in 'aggregation' functions in this API at the moment. If you wish to determine the capacity of a folder, for example, you have the following options:
Pipe through
jq(described in some of the examples below)Use another method (such as the
vastdb-sdk), as described in the document here: https://vastdb-sdk.readthedocs.io/ .Use an application such as Trino , as described here: https://vast-data.github.io/data-platform-field-docs/vast_database/trino/quickstart.html
It is typically recommended to use a
limit=50when first experimenting with queries, so that it doesn't overrun your terminal or output. If you choose to perform aggregations usingjqPlease remove the limit, or else your calculations will be incorrect.
Using vastpy-cli for Catalog Queries
First, an example to list all files that were not accessed in the last 90 days:
# Basic catalog query for files not accessed in the last 90 days
vastpy-cli post --json bigcatalogconfig/query_data \
limit=50 \
path="/vperfsanity" \
fields='["parent_path", "name", "atime", "size"]' \
filters='{"atime": [{"lt": "'$(date -d '90 days ago' +%s)'000"}]}' | jqOutput (truncated):
{
"count": 256,
"next": null,
"prop_list": [
"parent_path",
"name",
"atime",
"size"
],
"results": [
[
"/vperfsanity/",
"vperfsanity21",
"2025-01-14T20:14:49.499298",
4194304000
],
[
"/vperfsanity/",
"vperfsanity155",
"2025-01-14T20:14:47.415081",
4194304000
]
]
}Same query, but this time only retrieve the size field and perform a sum (notice that we removed the limit from this example to get an accurate sum.
vastpy-cli post --json bigcatalogconfig/query_data \
path="/vperfsanity" \
fields='["size"]' \
filters='{"atime": [{"lt": "'$(date -d '90 days ago' +%s)'000"}]}' | jq '([.results[][]] | add)'Output:
1073741824000Obviously, you may want to convert from bytes into something more useful, like GiB or TiB:
vastpy-cli post --json bigcatalogconfig/query_data \
path="/vperfsanity" \
fields='["size"]' \
filters='{"atime": [{"lt": "'$(date -d '90 days ago' +%s)'000"}]}' | jq '([.results[][]] | add) / (1024*1024*1024)'Advanced Query Examples
Finding Large Files
# Find files larger than 1GB
vastpy-cli post --json bigcatalogconfig/query_data \
limit=50 \
path="/vperfsanity" \
fields='["parent_path", "name", "size"]' \
filters='{"size": [{"gt": 1073741824}]}'Finding Recently Modified Files
(Note the use of gt as opposed to lt)
# Find files modified in the last day
vastpy-cli post --json bigcatalogconfig/query_data \
limit=50 \
path="/" \
fields='["parent_path", "name", "mtime"]' \
filters='{"mtime": [{"gt": "'$(date -d '1 day ago' +%s)'000"}]}' | jqFinding Files by Owner
# Find files owned by a specific user
vastpy-cli post --json bigcatalogconfig/query_data \
limit=50 \
path="/" \
fields='["parent_path", "name", "login_name"]' \
filters='{"login_name": "maria.gutierrez@selab.vastdata.com"}'Find Files by filename
vastpy-cli post --json bigcatalogconfig/query_data \
limit=50 \
path="/" \
fields='["parent_path", "name", "login_name"]' \
filters='{"name": "comment.mp4"}'Output
{
"count": 2,
"next": null,
"prop_list": [
"parent_path",
"name",
"login_name"
],
"results": [
[
"/scratch/home/rayc-flow/MusM/",
"comment.mp4",
null
],
[
"/scratch/home/jhays-flow/feasibleness/slim/",
"comment.mp4",
null
]
]
}Tips for Large-Scale Analysis
When working with large datasets:
Filter Early - Apply specific filters to reduce the dataset size
Paginate Results - For large result sets, use pagination to process data in chunks
Schedule Regular Analysis - Set up regular analysis jobs to track capacity trends over time
Conclusion
The VAST capacity and catalog analysis tools provide powerful ways to understand your data usage patterns, identify optimization opportunities, and manage your storage more effectively. By combining direct capacity queries with catalog analysis, you can gain comprehensive insights into your VAST Data platform.
For more detailed information about the VAST API, please refer to your VAST system's documentation at https://<your-vast-vms-address>/docs.