Trino & VAST Catalog Quick Start Guide

Prev Next

Trino provides a very powerful and flexible way to connect to the VAST Catalog. Trino enables querying the catalog from the Trino Python client, Querybook, Grafana, and other software that utilize Trino as a data source. VAST has published Docker containers with everything you need to connect to your VAST cluster.

Prerequisites

This guide assumes that you have:

  • Enabled the catalog

  • Have an identity policy granting access to it

  • Have a user associated with the identity policy

  • The user has S3 credentials

  • A Linux system with Docker installed and a user with permissions to start containers.

Refer to “Providing Client Access to VAST Catalog CLI” in the VAST Support documentation for details on the first four items.

Configure & start Trino container

Create a vast.properties with these contents, update with the correct endpoint and access/secret keys:

## Update these fields ##
# "endpoint" should be a load-balanced DNS entry or one of the VIPs prefixed by "http://"
# it should not contain a trailing / or anything else.
endpoint=http://x.x.x.x

# "data_endpoints" should be be a load-balanced DNS entry or one or more of the VIPs
# prefixed by "http://" it should not contain a trailing / or anything else.
# Multiple VIPs can be used with commas between them, eg: http://x.x.x.x,http://y.y.y.y
data_endpoints=http://x.x.x.x

# Access and secret keys -- make sure the user was added to an identity policy
# granting them access to the catalog.
access_key_id=xxx
secret_access_key=xxx

## Don't change these fields ##
connector.name=vast
region=us-east-1

num_of_splits=32
num_of_subsplits=8

vast.http-client.request-timeout=60m
vast.http-client.idle-timeout=60m

enable_custom_schema_separator=true
custom_schema_separator=|

Start the VAST-provided Trino Docker container – they already contain the connector and the correct jvm.config updates.

ℹ️ Info

The version of the Trino container depends on your cluster version. See the full docs (Installing and Configuring the VAST Connector for Trino) for details.

In short:

  • 4.7 - use vastdataorg/trino-vast:375

  • 5.0 - use vastdataorg/trino-vast:420

  • 5.1 - use vastdataorg/trino-vast:429

docker run \
    --name trino \
    -p 8080:8080 -d \
    -v ./vast.properties:/etc/trino/catalog/vast.properties:ro \
    vastdataorg/trino-vast:429

Test with the Trino client

Start the client from within the Trino container:

docker exec -it trino trino

Now you can execute queries against the server – you must start with the use command to set the context:

use vast."vast-big-catalog-bucket|vast_big_catalog_schema";
show columns from vast_big_catalog_table;
select * from vast_big_catalog_table limit 1;

Optionally install Java, download the client, and use the client directly:

sudo yum install java-openjdk

wget https://repo1.maven.org/maven2/io/trino/trino-cli/429/trino-cli-429-executable.jar
java -jar trino-cli-429-executable.jar

Connect with the trino Python library

You can also query the catalog with the trino python library, eg:

from trino.dbapi import connect

TRINO_HOST = "localhost"
TRINO_PORT = 8080

# establish our connection to Trino
conn = connect(
    host=TRINO_HOST,
    port=TRINO_PORT,
    user="vast",
    catalog="vast",
    schema="vast-big-catalog-bucket/vast_big_catalog_schema",
)

# find all the files to tag
cur = conn.cursor()
cur.execute(
    """
    SELECT *
    FROM vast_big_catalog_table
    LIMIT 1
    """
)
for row in cur.fetchall():
    print(row)

Useful starter queries

Describe the table to see all the columns available:

describe vast."vast-big-catalog-bucket/vast_big_catalog_schema".vast_big_catalog_table

Example output:

       Column        |                  Type                   | Extra |
---------------------+-----------------------------------------+-------+
 phandle             | row(clone_id integer, handle_id bigint) |       |
 creation_time       | timestamp(9)                            |       |
 uid                 | integer                                 |       |
 owner_sid           | varchar                                 |       |
 owner_name          | varchar                                 |       |
 gid                 | integer                                 |       |
 group_owner_sid     | varchar                                 |       |
 group_owner_name    | varchar                                 |       |
 atime               | timestamp(9)                            |       |
 mtime               | timestamp(9)                            |       |
 ctime               | timestamp(9)                            |       |
 nlinks              | bigint                                  |       |
 element_type        | varchar                                 |       |
 size                | bigint                                  |       |
 used                | bigint                                  |       |
 tenant_id           | integer                                 |       |
 name                | varchar                                 |       |
 extension           | varchar                                 |       |
 parent_path         | varchar                                 |       |
 symlink_path        | varchar                                 |       |
 major_device        | integer                                 |       |
 minor_device        | integer                                 |       |
 s3_locks_retention  | row(mode tinyint, timeout time(6))      |       |
 nfs_mode_bits       | integer                                 |       |
 name_aces_exist     | boolean                                 |       |
 s3_locks_legal_hold | boolean                                 |       |
 user_tags_count     | smallint                                |       |
 user_metadata       | map(varchar, varchar)                   |       |
 user_tags           | map(varchar, varchar)                   |       |
 login_name          | varchar                                 |       |
 search_path         | varchar                                 |       |

Get the largest 10 files on the system last modified in the past 3 months:

select parent_path, name, size, mtime
from vast."vast-big-catalog-bucket/vast_big_catalog_schema".vast_big_catalog_table
where mtime > (current_timestamp - interval '3' month)
order by size
desc limit 10

Get the largest 10 files on the system created more than 6 months ago:

select parent_path, name, size, ctime
from vast."vast-big-catalog-bucket/vast_big_catalog_schema".vast_big_catalog_table
where ctime < (current_timestamp - interval '6' month)
order by size
desc limit 10

Show how much space is being used by the user, sorted by the space used:

select owner_name, sum(size) as size
from vast."vast-big-catalog-bucket/vast_big_catalog_schema".vast_big_catalog_table
group by owner_name
order by size desc
limit 10

Show the 10 directories taking up the most space:

select parent_path, sum(size) as size
from vast."vast-big-catalog-bucket/vast_big_catalog_schema".vast_big_catalog_table
group by parent_path
order by size desc
limit 10

Query for a specific tag (below as tagname) by value (below as tagvalue):

select parent_path, name, size
from vast."vast-big-catalog-bucket/vast_big_catalog_schema".vast_big_catalog_table
where user_tags_count > 0 and contains(map_keys(user_tags), 'tagname') and user_tags['tagname'] = 'tagvalue'
order by size desc
limit 10

Stop the Trino server

To stop the Trino container after you’re done:

docker stop trino
docker rm trino

Troubleshooting

If you encounter the error Vast Server Error: Failed starting transaction, something is wrong with your access keys, or the user with those access keys has not been granted access to the catalog.