VAST Catalog Overview

VAST Catalog is a database that indexes metadata attributes of all data on the cluster, enabling high performance querying of files, objects and directories based on classification of data according to attributes used by multiple access protocols. Querying is available to storage administrators through the VAST Web UI as well as through the VMS API where you can run queries. Querying is available to all users (including end users on the cluster's client network) through the VAST Catalog CLI or connected third party query engines.

In clusters with multiple tenants, queries are restricted to data for the tenant of the user making the query.

VAST Catalog indexes the metadata attributes of the cluster's data from periodic snapshots of the cluster's data. The database is stored on a dedicated S3 bucket on the cluster.

Usage Examples

VAST Catalog can be used to:

Find files, such as:
- Finding all files older than 90 days larger than 10GB that reside in the /projects directory.
- Finding all files created since last week by a specific user.
- Finding all S3 objects with the tag processed where the value of the tag is false.
Report capacity and usage, such as:
- Ranking users consuming the most capacity in specific folder/projects
- Ranking capacity usage by file extension

Tools for Querying the VAST Catalog

The following tools are available for querying VAST Catalog:

The VAST Catalog page in the VAST Web UI. This page provides a graphic user interface for easy building and execution of queries against VAST Catalog. The tool returns results in seconds and displays them in a customizable table of columns. You can choose to display any selection of VAST Catalog's indexed columns for the query results. You can also export the results to CSV format for further processing.
The VAST Catalog Command Line Interface (CLI). This interface brings the same fast and powerful query capabilities to clients on the enterprise network. It also runs in a standard Unix shell and therefore you can pipeline it with other Unix toolsets. VAST Database CLI Quick Start Guide
You can expose the VAST Catalog to the Trino and Spark open source query engines, using a storage connector that can be downloaded from here.
The VAST Database SDK. This SDK is a Python-based API designed for interacting with the VAST Database and the VAST Catalog. You can use It for operations such as schema and table management, data querying, and transaction handling. It includes libraries for HTTP requests, pyarrow for handling Apache Arrow data formats, and flatbuffers for efficient serialization of data structures.

Optimizing VAST Catalog Queries

You can optimize queries of the VAST Catalog using search_path. This is a virtual column that specifies a subtree of the Catalog within which the query will be restricted.

For example, the query

SELECT … WHERE search_path='/some/path/'

queries the Catalog, but restricts it to the subtree /some/path. The issuer of the query must have permission to read the subtree (from an Identity Policy).

VAST Catalog Schema

VAST Catalog indexes the following attributes as columns:

Attribute/Column	Type	Description
`creation_time`	timestamp	The date and time of element creation
`uid`	integer	Owner's POSIX UID
`owner_sid`	varchar	Owner's SMB SID
`owner_name`	varchar	Owner's user name
`gid`	integer	Owner group's POSIX GID
`group_owner_sid`	integer	Owner group's SMB SID
`group_owner_name`	varchar	Owner group's name
`atime`	timestamp	Time of last file access
`mtime`	timestamp	Time of last modification
`ctime`	timestamp	Time of last file system metadata change
`nlinks`	integer	Number of associated hard links
`element_type`	varchar	Type of element, such as FILE (file), DIR (directory)
`size`	integer	The size of the element
`used`	integer	Number of bytes used on disk
`name`	varchar	Element name
`extension`	varchar	File extension
`parent_path`	varchar	Parent path of file
`symlink_path`	varchar	The path of the link, if the element is a symbolic link
`major_device`	integer	Major device number
`minor_device`	integer	Minor device number
`nfs_mode_bits`	integer	POSIX permissions mode bits
`name_aces_exist`	boolean	Indicates whether or not extended ACEs (NTFS/NFS4/POSIX) exist on this element
`s3_locks_legal_hold`	boolean	Indicates whether there is an S3 object locking legal hold on this object
`user_tags_count`	integer	The number of S3 tags an element has
`user_tags`	map()	S3 tags associated with the element
`user_metadata`	map()	S3 metadata items on object (x-amz-meta)
`login_name`	varchar	The login name of the element owner

It is also possible to add user-defined S3 tags and S3 metadata as additional columns.