This document is relevant for: Inf2
, Trn1
, Trn2
NCCOM-TEST User Guide#
Overview#
nccom-test is a benchmarking tool for evaluating Collective Communication operations on AWS Trainium and Inferentia instances. It supports Trn1, Trn2, and Inf2 instance types. The tool can assess performance across multiple instances or perform quick environment sanity checks before running more complex workloads. While single-instance benchmarking is supported for all compatible instance types, multi-instance benchmarking is limited to Trainium instances (Trn1 and Trn2).
Note
On Inf2 instances, only single-instance benchmarking is supported. Running a multi-node nccom-test benchmark will result in an error.
Using nccom-test#
Here is a simple example which will run a 2 worker (ranks) all-reduce with a total size of 32MB:
nccom-test -r 2 allr
size(B) count(elems) type time(us) algbw(GB/s) busbw(GB/s)
33554432 33554432 uint8 768 40.69 40.69
Avg bus bandwidth: 40.6901GB/s
Output description#
The command will output a table containing several columns containing performance metrics. There will be a line for every requested data size (by default the data size is 32MB as seen in the previous example).
Column name |
Description |
---|---|
size(B) |
Size in bytes for the data involved in this operation |
count(elems) |
Number of elements in the data involved in this operation. For example, if size(B) is 4 and type is fp32, then count will be 1 since one single fp32 element has been processed. |
type |
Data type for the processed data. Can be: uint8, int8, uint16, int16, fp16, bf16, int32, uint32, fp32 |
time(us) |
Time in microseconds representing the P50 of all durations for the Collective Communication operations executed during the benchmark. |
algbw(GB/s) |
Algorithm bandwidth in gibibytes (1GiB = 1,073,741,824 bytes) per second which is calculated as size(B) / time(us) |
busbw(GB/s) |
Bus bandwidth - bandwidth per data line in gibibytes per second - it provides a bandwidth number that is independent from the number of ranks (unlike algbw). For a more in-depth explanation on bus Bandwidth, please refer to NVIDIA’s nccl-tests documentation. |
Avg bus bandwidth |
Average of the values in the busbw column |
CLI arguments#
Argument |
Default value |
Description |
---|---|---|
<cc operation> |
N/A, required argument |
The type of Collective Communication operation to execute for this benchmark. Supported types:
|
|
N/A, required argument |
Total number of workers (ranks) to use |
|
1 |
Total number of nodes (instances) to use. The number of workers will be divided equally across all nodes.
If this argument is greater than 1, the NEURON_RT_ROOT_COMM_ID environment variable needs to be set to
the host address of the instance nccom-test is ran on, and a free port number
(for example: |
|
32M |
The starting size for the benchmark |
|
32M |
The end size for the benchmark. nccom-test will run benchmarks for all sizes between |
|
( |
Amount of bytes with which to increase the benchmark’s size on every subsequent run.
For example, for this combination of arguments: |
|
N/A |
Factor with which to increase the benchmark’s size on every subsequent run.
For example, for this combination of argument values: |
|
20 |
Number of Collective Communication operations to execute during the benchmark. |
|
5 |
Number of Collective Communication operations to execute as warmup during the benchmark (which won’t be counted towards the result). |
|
|
Data type for the data used by the benchmark. Supported types: |
|
false |
If provided, the corectness of the operations will be checked. This will not impact results (time, algbw and busbw) but will slightly increase the overall execution time. |
|
N/A |
Hosts on which to run execution. Checks |
|
false |
Do not display any animation or progress indicator. |
Note
All arguments that take a size in bytes will also accept larger size units, for example:
-f 2048
can be written as -f 2kb
or -f 1048576
can be written as -f 1MB
.
Examples#
Note
Performance data shown in these examples should not be considered up-to-date. For the latest performance data, please refer to the performance section.
Single Instance Examples#
Quick environment validation
nccom-test -r 2 allr size(B) count(elems) type time(us) algbw(GB/s) busbw(GB/s) 33554432 33554432 uint8 768 40.69 40.69 Avg bus bandwidth: 40.6901GB/s
If a problem was found, it can be reported in two possible ways:
Immediately:
nccom-test -r 2 allr Neuron DKMS Driver is not running! Read the troubleshooting guide at: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-troubleshoot.html#neuron-driver-installation-fails
After a benchmark attempt:
nccom-test -r 2 allr size(B) count(elems) type time(us) algbw(GB/s) busbw(GB/s) 33554432 Failure running neuron-bench - log file /tmp/nccom_test_log_7pqpdfjf.log 1 errors found - test failed
In this case, further information about the error can be found in the
neuron-bench
log file.
2 rank all-reduce on a single instance for sizes ranging from 1MiB to 1GiB with a step of 4x
nccom-test -r 2 --minbytes 1kb --maxbytes 1gb --stepfactor 4 --datatype fp32 allr size(B) count(elems) type time(us) algbw(GB/s) busbw(GB/s) 1024 256 fp32 58 0.02 0.02 4096 1024 fp32 58 0.07 0.07 16384 4096 fp32 58 0.26 0.26 65536 16384 fp32 58 1.05 1.05 262144 65536 fp32 60 4.07 4.07 1048576 262144 fp32 68 14.36 14.36 4194304 1048576 fp32 107 36.51 36.51 16777216 4194304 fp32 332 47.06 47.06 67108864 16777216 fp32 1214 51.48 51.48 268435456 67108864 fp32 4750 52.63 52.63 1073741824 268435456 fp32 18930 52.83 52.83 Avg bus bandwidth: 23.6671GB/s
32 rank all-gather on a single instance for sizes ranging from 1KiB to 1MiB with a step of 8x, with correctness checking
nccom-test -r 32 --minbytes 1kb --maxbytes 1mb --stepfactor 8 --datatype fp32 --check allg
size(B) count(elems) type time(us) algbw(GB/s) busbw(GB/s)
1024 256 fp32 151 0.01 0.01
8192 2048 fp32 149 0.05 0.05
65536 16384 fp32 150 0.41 0.39
524288 131072 fp32 179 2.73 2.64
Avg bus bandwidth: 0.7731GB/s
Multiple Instances Example#
64 rank all-reduce on two instances for sizes ranging from 8 bytes to 1GiB with a step of 2x, running 50 ops
NEURON_RT_ROOT_COMM_ID=10.1.4.145:45654 nccom-test -r 64 -N 2 -b 8 -e 1GB -f 2 -n 50 -w 5 -d fp32 allr --hosts 127.0.0.1 10.1.4.138 size(B) count(elems) type time(us) algbw(GB/s) busbw(GB/s) 8 2 fp32 520 0.00 0.00 16 4 fp32 520 0.00 0.00 32 8 fp32 523 0.00 0.00 64 16 fp32 525 0.00 0.00 128 32 fp32 553 0.00 0.00 256 64 fp32 709 0.00 0.00 512 128 fp32 782 0.00 0.00 1024 256 fp32 840 0.00 0.00 2048 512 fp32 881 0.00 0.00 4096 1024 fp32 916 0.00 0.01 8192 2048 fp32 1013 0.01 0.01 16384 4096 fp32 1031 0.01 0.03 32768 8192 fp32 1174 0.03 0.05 65536 16384 fp32 1315 0.05 0.09 131072 32768 fp32 1315 0.09 0.18 262144 65536 fp32 1311 0.19 0.37 524288 131072 fp32 1312 0.37 0.73 1048576 262144 fp32 1328 0.74 1.45 2097152 524288 fp32 1329 1.47 2.89 4194304 1048576 fp32 1378 2.83 5.58 8388608 2097152 fp32 1419 5.51 10.84 16777216 4194304 fp32 2138 7.31 14.39 33554432 8388608 fp32 2711 11.53 22.69 67108864 16777216 fp32 3963 15.77 31.05 134217728 33554432 fp32 6279 19.91 39.19 268435456 67108864 fp32 11954 20.91 41.17 536870912 134217728 fp32 21803 22.93 45.15 1073741824 268435456 fp32 41806 23.92 47.09 Avg bus bandwidth: 9.3924GB/s
This document is relevant for: Inf2
, Trn1
, Trn2