Vector Search

Prev Next

VAST Databases support vector operations. This includes a vector data type (an array of floats), and functions to calculate and compare the distance between vectors and return the nearest neighbors of a specific vector.

Vector operations are supported using the VAST Query Engine.

Note

This feature is introduced in version 5.4. For this version, vector searches are performed using brute force techniques, which are not yet optimized.

Vector Functions

VAST Databases array (vector) distance functions array_distance, array_cosine_distance, and array_negative_innder_product.

Filtering Vector Searches

You can filter vector searches using standard query filters (such as WHERE clauses). You can also filter rows based on VAST user permissions (see VAST Row and Column Security filters), so that rows (vectors) are visible in the search according to permissions for the user running the search.

Example

This code snippet illustrates the use of distance functions.

import pyarrow as pa
import vastdb
import adbc_driver_manager
import datetime as dt

# Define parameters, for example:
# VASTDB_ENDPOINT = 'http://....'
# AWS_ACCESS_KEY_ID = 'AAAAAAAAAAAAAAAAAAAA'
# AWS_SECRET_ACCESS_KEY = 'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB'
# BUCKET_NAME = 'my-bucket' # Should already exist.
# SCHEMA_NAME =  'my-schema'
# TABLE_NAME = 'my-table'
# VAST_ADBC_DRIVER_PATH = '/tmp/libadbc_driver_vastdb.so'

# Create the table and insert data using VastDB SDK, for more information https://github.com/vast-data/vastdb_sdk?tab=readme-ov-file
session = vastdb.connect(
    endpoint=VASTDB_ENDPOINT,
    access=AWS_ACCESS_KEY_ID,
    secret=AWS_SECRET_ACCESS_KEY)

with session.transaction() as tx:
    bucket = tx.bucket(BUCKET_NAME)

    # Create the schema in the bucket.
    schema = bucket.create_schema(SCHEMA_NAME)

    # Create the table.
    dimension = 5
    columns = pa.schema([("id", pa.int64()),
                         ("vec", pa.list_(pa.field(name="item", type=pa.float32(), nullable=False), dimension)),
                         ('vec_timestamp', pa.timestamp('us'))])
    table = schema.create_table(TABLE_NAME, columns)

    # Insert a few rows of data.
    arrow_table = pa.table(schema=columns, data=[
        [1, 2, 3],
        [[1,2,3,4,5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15]],
        [dt.datetime(2024, 4, 10, 12, 34),
         dt.datetime(2024, 4, 11, 12, 34),
         dt.datetime(2024, 4, 13, 12, 34)]
    ])
    table.insert(arrow_table)

# Query the table using the ADBC driver.
with adbc_driver_manager.dbapi.connect(driver=VAST_ADBC_DRIVER_PATH, db_kwargs= {
    "vast.db.endpoint": VASTDB_ENDPOINT,
    "vast.db.access_key": AWS_ACCESS_KEY_ID,
    "vast.db.secret_key": AWS_SECRET_ACCESS_KEY}
                                       ) as connection:
    with connection.cursor() as cursor:
        full_table_name = f'"{BUCKET_NAME}/{SCHEMA_NAME}"."{TABLE_NAME}"'
        queries = [
            # Select all the rows.
            f"SELECT * FROM {full_table_name};",
            # Euclidian distance.
            f"SELECT * FROM {full_table_name} "
            f"WHERE vec_timestamp > '2024-12-11 11:30:00' ORDER BY "
            f"array_distance(vec, [1.5, 2.5, 3.5, 4.5, 5.5]::FLOAT[5]) LIMIT 2;",
            # Cosine distance.
            f"SELECT * FROM {full_table_name} "
            f"WHERE vec_timestamp > '2024-12-11 11:30:00' ORDER BY "
            f"array_cosine_distance(vec, [1.5, 2.5, 3.5, 4.5, 5.5]::FLOAT[5]) LIMIT 2;",
        ]
        for query in queries:
            cursor.execute(query)
            output_as_pandas_dataframe = cursor.fetch_arrow_table().to_pandas()
            print(f'{query=} {output_as_pandas_dataframe=}')