VAST Probe - Manual Execution

Summary

The VAST Probe is a long-running process in a Docker container. The Docker container needs to run on a Linux system that has read-only access to the files you want to examine for data reduction, as well as reasonable memory and a substantial, fast local disk.

When you have issues with the probe_launcher.py script or need more experimental features, use this page.

Quick Start

If you already know all the details and just want to run the probe, go to the VAST Probe Quickstart. This document is still available for cases where the quick start isn't appropriate, but the quick start approach is much preferred and should be used in all cases except where it’t work.

Prerequisites

Note: for hardware requirements, please see VAST Probe Requirements

The probe requires only a few things in order to function:

a functioning Docker environment
SELINUX is disabled (SELINUX will prevent root access to Docker mountpoints: local disk and NFS)
a large scratch area that the probe will use for storing logs, as well as information on the data it is scanning. This needs to be very large and should be about 0.25% of the size of the storage being scanned. Details below. By convention, this is /mnt/probe, but it can be any path.
file systems to scan. These can be local or, more typically, NFS mounted. It does not matter to the probe. The probe requires only read access. It does not modify any data on the file system. Root access is recommended to ensure all files can be read. Note that Lustre/GPFS can also work if they are mounted to the host.

At the end of the scanning process, it reports its findings. If email is enabled, the probe will report intermediate and final findings to VAST via email. If email is not enabled, all reporting is done via log files stored on the system running the probe.

How long the probe takes to run depends on the amount of data it points at. It's important that you point the probe at enough data that you get a representative combination of data sets, and you get a good opportunity for the probe to find similarity across files and thus better data reduction. For example, if you have some data type (let's say genomic), we should scan many genomic files, not just one.

How fast the probe runs depends on many variables, including network speed, the size of the server running the probe, the performance of the disk the probe writes to (SSD is important), and the amount of similarity we find.

There is now a separate probe_launcher script, which simplifies operation. Instructions are in the VAST Probe Quickstart article. If you don't have python3 or would prefer the manual method, please see below.

Manual Probe Method

When the probe Docker container is launched, you can connect to it and run the probe with key configuration information. The probe will then run until completion and report results.

Get Docker container image

HTTPS Download

Refer to the VAST Probe Quickstart for the latest probe version and its build information.

eg:

export PROBE_BUILD=GET THIS NUMBER FROM THE QUICKSTART ARTICLE

**NEW STEP**: untar it

tar -xzf ${PROBE_BUILD}.probe.bundle.tar.gz

load it

docker load -i ${PROBE_BUILD}.probe.image.gz

(This will take a few minutes without meaningful output; be patient)

Tag the loaded image by running docker images and noting the new image and its ID. Recall that images are identified by unique image IDs and human-readable tags. Tag it as shown below - get the ID from the Docker images output, and the value of the name is by convention the probe build.

docker images

...notice the Image IDs in the output list ....

docker tag <hex image tag value from output> vast-probe-${PROBE_BUILD}

Launch a 'screen' session (or tmux). We recommend some kind of long-lived session tool since the probe can take a very long time to run, and we do not want it to terminate if there is an issue with the client system.
```
screen -R probe
```
Run the container while mapping the required directories
1. Run with the image tag/name you set earlier. The -v specifies mounts from the real operating system that should be made available to Docker. These are directories that the probe can use and scan. Include as many -v's as needed, just ensuring that at least one is the actual probe scratch directory (/mnt/probe in this case).
```
docker run -v /mnt/fileserver1:/mnt/fileserver1 -v /mnt/probe:/mnt/probe -it vast-probe-${PROBE_BUILD}
```

from within the Docker container, create relevant output directories, eg:

sudo mkdir -p /mnt/probe/vast-probe/output
sudo mkdir -p /mnt/probe/vast-probe/db
sudo chmod -R 777 /mnt/probe/vast-probe
#note: If you get permission denied then disable selinux on your host

Edit probe config file:

vim /vast/install/probe/sim_init_file.yml

See example config below, but also some items to note:
1. input_dir: You can specify more than one. Just prefix a newline with '-'. This will allow the probe to scan multiple mount points and file systems. Each input directory is scanned in a parallel thread, which can slightly improve probe scan times.
2. output_dir: This is where the summary and some stats files will go. This is relatively small (<GB, although it could get larger if you are scanning a lot of paths.)
3. metadata_dir: If using disk-based indexes, the space here needs to be pretty large (1% of the total dataset is very safe).
4. match_disable: If you set it to '1', it will do 'local-only' compression/dedup. This completes much more quickly but does not perform any similarity hashing.
5. max_number_of_files: This effectively pre-allocates some RAM to hold for file pointers. Set this value to somewhat higher than the total number of files you expect the probe to scan. Every 1-million files takes up 50MB of RAM. 1-billion is 50GB. Make sure not to set a value that causes the file pointer cache to exceed 50% of system RAM.
6. disk_size_gb: Set this to use disk based index. If you set it to 0, it will instead use a RAM-based index (see next variable). The Index is ~80% of the probe metadata, so a rule of thumb here is that if you have a dedicated SSD-based file system for probe md the rule of thumb is to put here 80% of the disk size. And remember the free disk space size needs to be 0.6% of the total dataset size (this has a safety margin).
7. ram_size_gb: if disk index is not used, the probe will use RAM for indexing. This is faster but may produce inaccurate results for large data sets. If this value is left unset, the probe will use 80% of the available system memory.
8. IOPS_limit: can be used to limit the read rate from the target system. The IO size is the chunk size (default 32K), e.g., IOPS_limit: 1000 → ~320 MB/sec.
9. sample config:
```
input_dir:
  - '/mnt/fileserver/data/stuff'
filter: '*'
output_dir: '/mnt/probe/vast-probe/output' # dir for log files
metdata_dir: '/mnt/probe/vast-probe/db' # dir for probe metadata

regexp_filter: '' # files/directories matching the filter will NOT be scanned by the probe
 
send_from: 'andy@vastdata.com'
send_to:
 - 'andy@vastdata.com'
 - 'probe.callhome@vastdata.com'
 
remote_monitoring_freq: 100 # sending mail with stats line, every remote_monitoring_freq seconds
SMTP_host: 'localhost' # put an SMTP relay here
 
remove_db_dir: 0 #remove db dir after each run? 1 for yes, 0 for no
ignore_links: 1 #1 for yes, 0 for no
 
IOPS_limit: 0 #for no limit, put 0
number_of_threads: 0 # for one thread per core, put 0
printing_frequency: 1 # in seconds
open_files_limit: 0 #for no limit, put 0
obfuscate_files_names: 0 #1 for yes, 0 for no
match_disable: 0 #1 to disable matches, 0 to enable
ram_size_gb: 0 # RAM for indexes (in GB), 0 will make the probe us ~80% of the available system memory
disk_size_gb: 100 # if set will use disk to store the similarity index.
 
pause: #'7:15' #hh:mm or leave blank
resume: #'17:16' #hh:mm or leave blank
 
split:
...
```
Once you are satisfied, copy the .yml file to somewhere outside of the container (NFS mount or via SCP), since it will not survive container restart
```
cp /vast/install/probe/sim_init_file.yml /mnt/probe/
```

Launch The VAST Probe

While still connected to the probe's Docker container, go to the probe's home directory.
Note: if you need to run the probe a second time, you can copy the save sim_init_file.yml file from /mnt/probe into the container at /vast/install/probe.
1. ```
cd /vast/install/probe
```
Run it:
1. sudo is required if root is needed in order to access one of the directories configured in the init file.
```
sudo python36 ./probe.py
```
Wait.....

Probe Stages

Some of these stages run concurrently (eg, Treewalk can run in the background throughout)

Treewalk Phase

When the probe is first kicked off, it builds a list of all files, along with the size-in-bytes for each file. This process has recently been parallelized to use more threads to perform this tree walk; however, depending on the source filesystem, this may still take a significant amount of time. Note that this runs in the background, so the probe can make progress on other stages even while the treewalk phase is active.

As an alternative, you can specify the '--csv' option to point to a CSV file which looks like this:

/path/to/file.file,1234

(where 1234 = sizeInBytes)

DB Initialization Phase

The probe needs to initialize the Dictionary/Database, which is used for storing matches. Depending on the speed of the storage that hosts the database (specified via 'metadata_dir' ), this can take some time. Also note that the 'disk_size_gb' parameter directly determines how large the DB will be. During this phase, the probe will pre-allocate the DB by writing XX-GB to the metadata_dir.

DataScan Phase

Once initialization has occurred, this is when the actual probe-scanning happens. During this time, multiple threads are walking through the generated list of files and reading them to generate the various hashes, which are then inserted into the DB. You can monitor progress during this phase as described below.

Monitoring progress

While the probe is running, there are 2 ways to get progress:

Watch the screen

sudo python3 ./probe.py
mail sending off
Scanning input directories, this might take a while...
Scanned 144932 files, size 3.2TB
File scan completed
open file limit is 65536, it is recommended to allow as many open files as possible
Initializing probe.
336.386 GB/3.1718 TB (10.4%)     process_rate = 850.45 MB/sec   factor = 1.54

Check the log file

tail

tail -f /mnt/probe/vast-probe/output/probe_Mon_Jan_21_11_57_52_2019.log

This will give you info like this:

n_chunks = 482991, n_matched_chunks = 392628, dedups = 918, match_percent = 81.291% , sum_of_gain = 403245081, gain = 64.877, avg_gain_per_match = 1027.04, avg_match_hashes_per_match = 9.28467, decompressed_sum = 1.212 GB, compressed_sum = 204.86 MB, factor = 6.05886, ratio = 0.165048, sum_of_self_compress = 592.76 MB size_of_data_processed = 1.212 GB/1.324 GB 91.5786% number_of_inaccessible_files = 0/517401 size_of_inaccessible = 0 B/1.324 GB READ = 1.212 GB, RE-READ = 940.70 MB, Total READ = 2.131 GB process_rate = 34.33 MB/sec

Output Explanation

The summary probe output is described in Understanding VAST Probe Output.

Low-Level Output

In addition to the previous output, the probe will also periodically output lower-level information. These days, that information is not typically useful, but here is an explanation just in case.

n_chunks - The number of chunks processed by the probe (default size is 32K).

avg_chunk_size - The average chunk size.

n_matched_chunks - The number of chunks identified as similar to pre-existing chunks by similarity search.

match_percent - The percentage of chunks identified as similar to pre-existing chunks by similarity search.

sum_of_gain - The total space saved by similarity compression.

gain - The percentage of space saved by similarity compression.

avg_gain_per_match - The average amount of space saved per chunk from similarity compression.

avg_match_hashes_per_match - The average amount of matching hashes found during similarity search.

n_duplicate_chunks - The number of identical chunks found.

dedup_percent - The percentage of identical chunks found.

original_size - The amount of data processed by the probe.

compressed_sum - The estimated size of data post compression, dedup, and similarity compression.

factor - The compression factor (original_size / compressed_sum).

ratio - 1 / factor

sum_of_self_compress - The data size if only local compression (with the given chunk size) was applied.

size_of_data_processed - The progress indication.

number_of_inaccessible_files - The number of files that were found in the initial scan, but the probe didn't manage to read from them when trying to process them.

size_of_inaccessible - The amount of data that was found in the initial scan, but the probe didn't manage to read from it when trying to process them

READ - The amount of scanned data.

RE-READ - The amount of data that was re-read in order to perform global compression.

Total READ - READ + RE-READ

I thought it would be helpful to share results from a test run and an interpretation of those results for the benefit of others:

Here’s the last line of output with a summary:

n_chunks = 1120059762, avg_chunk_size = 32754.2, n_matched_chunks = 507860164, match_percent = 45.3422% , sum_of_gain = 99.367 GB, gain = 0.603369, avg_gain_per_match = 210.087, avg_match_hashes_per_match = 3.4, n_duplicate_chunks = 529286922, dedup_percent=47.2552, original_size = 33.3664 TB, compressed_sum = 2.7112 TB, factor = 12.3069, ratio = 0.0812552, sum_of_self_compress = 16.0828 TB, size_of_data_processed = 33.3664 TB/33.7412 TB 98.8889%, number_of_inaccessible_files = 17233/880015, size_of_inaccessible = 384.025 GB/33.7412 TB, READ = 33.3664 TB, RE-READ = 15.1350 TB, Total READ = 48.5014 TB

And the definitions that I think are most pertinent:

match_percent - percentage of chunks identified as similar to pre existing chunks by similarity search
sum_of_gain - total space saved by similarity compression
gain - percentage of space saved by similarity compression
dedup_percent - percentage of identical chunks found
original_size - amount of data processed by the probe
compressed_sum - estimated size of data post compression, dedup and similarity compression
factor - compression factor (original_size / compressed_sum)
sum_of_self_compress - data size if only local compression (with the given chunk size) was applied
size_of_data_processed - progress indication

This means the probe processed 33TB of data.

The “naive” compressed size would have been 16TB

The actual compressed size, including compression, dedup, and similarity compression was 2.7TB

thus the total factor of savings was 12 (33/2.7)

Digging a little deeper, we see that the majority of the savings came from dedup (47% of the chunks were identical) and compression, as it looks like similarity compression saved 0.6% for a total of 99GB.

Probe Analyze

After the probe completes a run, it will automatically analyze its own output from the log files and generate an analysis log (still quite long) that breaks down the data reduction achieved by directory and file extension.

In rare cases, you may need to run this manually. Here's how:

cd /vast/install/probe
python3 ./probe.py --analyze_log <probe log file>
.... output about processing files ....
Processed 1967860 files             
Writing probe run analysis to ..../probe_Date.log.analysis

I/O Behavior

Speed: From a scan-speed perspective, we've found that, on average, we see ~60 MB/sec per physical CPU core when running the probe in full "similarity hash" mode (the default value for 'match_disable'). Thus, a 20-core system would net approximately 1.2 GByte/sec. That said, performance is also highly dependent on the disk latency of the target system being scanned and is often delayed by random reads on that system.
Read amplification: The way our similarity hasher works is that if it discovers any matches, it will need to re-read a portion of the dataset to look for additional opportunities for dataReduction. When your data is highly similar, this can result in significant read amplification. Therefore, when determining the amount of time it will take to scan a file system, it is necessary to allow the probe to run for a period of time to determine the approximate 'Re-Read' ratio.
1. EG: look at the /mnt/probe/db/*.stats output to see.
match_disable=1: If you choose this setting (non-default), the probe will bypass similarity hashing, and instead only look for local compression opportunity, and full-chunk matches (for dedup). This is much less CPU-intensive, and we've found that the bottleneck will typically be either networking or the filesystem that it is scanning, up to a point. In my testing on a system with 25GigE, using this mode saw an average of 1.3 GB/sec (about 66 MB/sec/physCore). At times, the network throughput approached line rate (2+ GByte/sec).
1. If you have a subset of data that is representative of a larger set, it would be advisable to run against the smaller set in this mode first, to determine the local compression & dedup rates. Once that rate is established, running the probe again in similarity-hash mode against the full dataset is recommended.