VAST Catalog Overview

Prev Next

VAST Catalog is a database that indexes metadata attributes of all data on the cluster, enabling high performance querying of files, objects and directories based on classification of data according to attributes used by multiple access protocols. Querying is available to storage administrators through the VAST Web UI as well as through the VMS API where you can run queries. Querying is available to all users (including end users on the cluster's client network) through the VAST Catalog CLI or connected third party query engines.  

In clusters with multiple tenants, queries are restricted to data for the tenant of the user making the query.

VAST Catalog indexes the metadata attributes of the cluster's data from periodic snapshots of the cluster's data. The database is stored on a dedicated S3 bucket on the cluster.

Usage Examples

VAST Catalog can be used to:

  • Find files, such as:

    • Finding all files older than 90 days larger than 10GB that reside in the /projects directory.

    • Finding all files created since last week by a specific user.

    • Finding all S3 objects with the tag processed where the value of the tag is false.

  • Report capacity and usage, such as:

    • Ranking users consuming the most capacity in specific folder/projects

    • Ranking capacity usage by file extension

Tools for Querying the VAST Catalog

The following tools are available for querying VAST Catalog:

  • The VAST Catalog page in the VAST Web UI. This page provides a graphic user interface for easy building and execution of queries against VAST Catalog. The tool returns results in seconds and displays them in a customizable table of columns. You can choose to display any selection of VAST Catalog's indexed columns for the query results. You can also export the results to CSV format for further processing.

  • The VAST Catalog Command Line Interface (CLI). This interface brings the same fast and powerful query capabilities to clients on the enterprise network. It also runs in a standard Unix shell and therefore you can pipeline it with other Unix toolsets. VAST Database CLI Quick Start Guide

  • You can expose the VAST Catalog to the Trino and Spark open source query engines, using a storage connector that can be downloaded from here.

  • The VAST Database SDK. This SDK is a Python-based API designed for interacting with the VAST Database and the VAST Catalog. You can use It for operations such as schema and table management, data querying, and transaction handling. It includes libraries for HTTP requests, pyarrow for handling Apache Arrow data formats, and flatbuffers for efficient serialization of data structures.

Optimizing VAST Catalog Queries

You can optimize queries of the VAST Catalog using search_path. This is a virtual column that specifies a subtree of the Catalog within which the query will be restricted.

For example, the query

SELECT … WHERE search_path='/some/path/'

queries the Catalog, but restricts it to the subtree /some/path. The issuer of the query must have permission to read the subtree (from an Identity Policy).

VAST Catalog Schema

VAST Catalog indexes the following attributes as columns:

Attribute/Column

Type

Description

creation_time

timestamp

The date and time of element creation

uid

integer

Owner's POSIX UID

owner_sid

varchar

Owner's SMB SID

owner_name

varchar

Owner's user name

gid

integer

Owner group's POSIX GID

group_owner_sid

integer

Owner group's SMB SID

group_owner_name

varchar

Owner group's name

atime

timestamp

Time of last file access

mtime

timestamp

Time of last modification

ctime

timestamp

Time of last file system metadata change

nlinks

integer

Number of associated hard links

element_type

varchar

Type of element, such as FILE (file), DIR (directory)

size

integer

The size of the element

used

integer

Number of bytes used on disk

name

varchar

Element name

extension

varchar

File extension

parent_path

varchar

Parent path of file

symlink_path

varchar

The path of the link, if the element is a symbolic link

major_device

integer

Major device number

minor_device

integer

Minor device number

nfs_mode_bits

integer

POSIX permissions mode bits

name_aces_exist

boolean

Indicates whether or not extended ACEs (NTFS/NFS4/POSIX) exist on this element

s3_locks_legal_hold

boolean

Indicates whether there is an S3 object locking legal hold on this object

user_tags_count

integer

The number of S3 tags an element has

user_tags

map()

S3 tags associated with the element

user_metadata

map()

S3 metadata items on object (x-amz-meta)

login_name

varchar

The login name of the element owner

It is also possible to add user-defined S3 tags and S3 metadata as additional columns.