This document is relevant for: Inf2, Trn1, Trn2

NCCOM-TEST User Guide#

Overview #

nccom-test is a benchmarking tool for evaluating Collective Communication operations on AWS Trainium and Inferentia instances. It supports Trn1, Trn2, and Inf2 instance types. The tool can assess performance across multiple instances or perform quick environment sanity checks before running more complex workloads. While single-instance benchmarking is supported for all compatible instance types, multi-instance benchmarking is limited to Trainium instances (Trn1 and Trn2).

Note

On Inf2 instances, only single-instance benchmarking is supported. Running a multi-node nccom-test benchmark will result in an error.

Using nccom-test #

Here is a simple example which will run a 2 worker (ranks) all-reduce with a total size of 32MB:

nccom-test -r 2 allr
     size(B)    count(elems)     type    time(us)    algbw(GB/s)    busbw(GB/s)
    33554432        33554432    uint8         768          40.69          40.69
Avg bus bandwidth:      40.6901GB/s

Output description #

The command will output a table containing several columns containing performance metrics. There will be a line for every requested data size (by default the data size is 32MB as seen in the previous example).

Column name	Description
size(B)	Size in bytes for the data involved in this operation
count(elems)	Number of elements in the data involved in this operation. For example, if size(B) is 4 and type is fp32, then count will be 1 since one single fp32 element has been processed.
type	Data type for the processed data. Can be: uint8, int8, uint16, int16, fp16, bf16, int32, uint32, fp32
time(us)	Time in microseconds representing the P50 of all durations for the Collective Communication operations executed during the benchmark.
algbw(GB/s)	Algorithm bandwidth in gibibytes (1GiB = 1,073,741,824 bytes) per second which is calculated as size(B) / time(us)
busbw(GB/s)	Bus bandwidth - bandwidth per data line in gibibytes per second - it provides a bandwidth number that is independent from the number of ranks (unlike algbw). For a more in-depth explanation on bus Bandwidth, please refer to NVIDIA’s nccl-tests documentation.
Avg bus bandwidth	Average of the values in the busbw column

CLI arguments #

Argument	Default value	Description
<cc operation>	N/A, required argument	The type of Collective Communication operation to execute for this benchmark. Supported types: `all_reduce` / `allr`: All-Reduce `all_gather` / `allg`: All-Gather `reduce_scatter` / `redsct`: Reduce-Scatter `sendrecv`: Send-Receive `alltoall`: All-to-All
`-r, --nworkers`	N/A, required argument	Total number of workers (ranks) to use
`-N, --nnodes`	1	Total number of nodes (instances) to use. The number of workers will be divided equally across all nodes. If this argument is greater than 1, the NEURON_RT_ROOT_COMM_ID environment variable needs to be set to the host address of the instance nccom-test is ran on, and a free port number (for example: `NEURON_RT_ROOT_COMM_ID=10.0.0.1:44444`). Additionally, either `-s, --hosts` needs to be provided or a `~/hosts` file needs to exist - for more details refer to the `-s,--hosts` description below.
`-b, --minbytes`	32M	The starting size for the benchmark
`-e, --maxbytes`	32M	The end size for the benchmark. nccom-test will run benchmarks for all sizes between `-b, --minbytes` and `-e, --maxbytes`, increasing the size by either `-i, --stepbytes` or `--f, --stepfactor` with every run.
`-i, --stepbytes`	(`--maxbytes` - `--minbytes`) / 10	Amount of bytes with which to increase the benchmark’s size on every subsequent run. For example, for this combination of arguments: `-b 8 -e 16 -i 4`, the benchmark will be ran for the following sizes: 8 bytes, 12 bytes, 16 bytes.
`-f, --stepfactor`	N/A	Factor with which to increase the benchmark’s size on every subsequent run. For example, for this combination of argument values: `-b 8 -e 32 -f 2`, the benchmark will be ran for the following sizes: 8 bytes, 16 bytes, 32 bytes.
`-n, --iters`	20	Number of Collective Communication operations to execute during the benchmark.
`-w, --warmup_iters`	5	Number of Collective Communication operations to execute as warmup during the benchmark (which won’t be counted towards the result).
`-d, --datatype`	`uint8`	Data type for the data used by the benchmark. Supported types: `uint8`, `int8`, `uint16`, `int16`, `fp16`, `bf16`, `uint32`, `int32`, `fp32`. Input data will be zero filled, unless `--check` is provided (currently, only available for `--datatype fp32`) in which case it will be filled by a repetead value of the requested type.
`-c, --check`	false	If provided, the corectness of the operations will be checked. This will not impact results (time, algbw and busbw) but will slightly increase the overall execution time.
`-s, --hosts`	N/A	Hosts on which to run execution. Checks `~/hosts` if not specified.
`--non-interactive`	false	Do not display any animation or progress indicator.
`--report-to-json-file`	N/A	Persist config and results to JSON file if a filepath is provided.
`--show-input-output-size`	false	Print or save to JSON per rank input and output sizes in B.

Note

All arguments that take a size in bytes will also accept larger size units, for example: -f 2048 can be written as -f 2kb or -f 1048576 can be written as -f 1MB.

Examples #

Note

Performance data shown in these examples should not be considered up-to-date. For the latest performance data, please refer to the performance section.

Single Instance Examples#

Quick environment validation

nccom-test -r 2 allr
    size(B)    count(elems)     type    time(us)    algbw(GB/s)    busbw(GB/s)
    33554432        33554432    uint8         768          40.69          40.69
Avg bus bandwidth:      40.6901GB/s

If a problem was found, it can be reported in two possible ways:

Immediately:

nccom-test -r 2 allr
Neuron DKMS Driver is not running! Read the troubleshooting guide at: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-troubleshoot.html#neuron-driver-installation-fails

After a benchmark attempt:

nccom-test -r 2 allr
     size(B)    count(elems)    type    time(us)    algbw(GB/s)    busbw(GB/s)
    33554432    Failure running neuron-bench - log file /tmp/nccom_test_log_7pqpdfjf.log
1 errors found - test failed

In this case, further information about the error can be found in the neuron-bench log file.

2 rank all-reduce on a single instance for sizes ranging from 1MiB to 1GiB with a step of 4x

nccom-test -r 2 --minbytes 1kb --maxbytes 1gb --stepfactor 4 --datatype fp32 allr
       size(B)    count(elems)    type    time(us)    algbw(GB/s)    busbw(GB/s)
          1024             256    fp32          58           0.02           0.02
          4096            1024    fp32          58           0.07           0.07
         16384            4096    fp32          58           0.26           0.26
         65536           16384    fp32          58           1.05           1.05
        262144           65536    fp32          60           4.07           4.07
       1048576          262144    fp32          68          14.36          14.36
       4194304         1048576    fp32         107          36.51          36.51
      16777216         4194304    fp32         332          47.06          47.06
      67108864        16777216    fp32        1214          51.48          51.48
     268435456        67108864    fp32        4750          52.63          52.63
    1073741824       268435456    fp32       18930          52.83          52.83
Avg bus bandwidth:      23.6671GB/s

32 rank all-gather on a single instance for sizes ranging from 1KiB to 1MiB with a step of 8x, with correctness checking

    nccom-test -r 32 --minbytes 1kb --maxbytes 1mb --stepfactor 8 --datatype fp32 --check allg
    size(B)    count(elems)    type    time(us)    algbw(GB/s)    busbw(GB/s)
    1024             256    fp32         151           0.01           0.01
    8192            2048    fp32         149           0.05           0.05
   65536           16384    fp32         150           0.41           0.39
  524288          131072    fp32         179           2.73           2.64
Avg bus bandwidth:      0.7731GB/s

Reporting input and output size explicitly with --show-input-output-size

nccom-test -r 32 --minbytes 1kb --maxbytes 1mb --stepfactor 8 --datatype fp32 --check allg --show-input-output-size
size(B)    count(elems)    total_input_size(B)    total_output_size(B)    type    time:avg(us)    algbw(GB/s)    busbw(GB/s)
   1024             256                     32                    1024    fp32            6.16           0.17           0.16
   8192            2048                    256                    8192    fp32            6.48           1.26           1.23
  65536           16384                   2048                   65536    fp32            8.17           8.02           7.77
 524288          131072                  16384                  524288    fp32           23.16          22.64          21.93
Avg bus bandwidth:      7.7715GB/s

Example results as JSON with --report-to-json-file

nccom-test -r 32 --minbytes 1kb --maxbytes 1mb --stepfactor 8 --datatype fp32 --check allg --report-to-json-file nccom-results.json
size(B)    count(elems)    type    time:avg(us)    algbw(GB/s)    busbw(GB/s)
   1024             256    fp32            6.19           0.17           0.16
   8192            2048    fp32            6.55           1.25           1.21
  65536           16384    fp32            8.18           8.01           7.76
 524288          131072    fp32           23.11          22.69          21.98
Avg bus bandwidth:      7.7775GB/s

python3 -m json.tool nccom-results.json
{
    "results": [
        {
            "size(B)": 1024,
            "count(elems)": 256,
            "type": "fp32",
            "algbw(GB/s)": 0.16553675170497603,
            "busbw(GB/s)": 0.16036372821419553,
            "time:avg(us)": 6.19
        },
        {
            "size(B)": 8192,
            "count(elems)": 2048,
            "type": "fp32",
            "algbw(GB/s)": 1.2500906056270864,
            "busbw(GB/s)": 1.21102527420124,
            "time:avg(us)": 6.55
        },
        {
            "size(B)": 65536,
            "count(elems)": 16384,
            "type": "fp32",
            "algbw(GB/s)": 8.008982241741455,
            "busbw(GB/s)": 7.758701546687035,
            "time:avg(us)": 8.18
        },
        {
            "size(B)": 524288,
            "count(elems)": 131072,
            "type": "fp32",
            "algbw(GB/s)": 22.688776793562784,
            "busbw(GB/s)": 21.97975251876395,
            "time:avg(us)": 23.11
        }
    ]
}

Multiple Instances Example#

64 rank all-reduce on two instances for sizes ranging from 8 bytes to 1GiB with a step of 2x, running 50 ops

NEURON_RT_ROOT_COMM_ID=10.1.4.145:45654 nccom-test -r 64 -N 2 -b 8 -e 1GB -f 2 -n 50 -w 5 -d fp32 allr --hosts 127.0.0.1 10.1.4.138
       size(B)    count(elems)    type    time(us)    algbw(GB/s)    busbw(GB/s)
             8               2    fp32         520           0.00           0.00
            16               4    fp32         520           0.00           0.00
            32               8    fp32         523           0.00           0.00
            64              16    fp32         525           0.00           0.00
           128              32    fp32         553           0.00           0.00
           256              64    fp32         709           0.00           0.00
           512             128    fp32         782           0.00           0.00
          1024             256    fp32         840           0.00           0.00
          2048             512    fp32         881           0.00           0.00
          4096            1024    fp32         916           0.00           0.01
          8192            2048    fp32        1013           0.01           0.01
         16384            4096    fp32        1031           0.01           0.03
         32768            8192    fp32        1174           0.03           0.05
         65536           16384    fp32        1315           0.05           0.09
        131072           32768    fp32        1315           0.09           0.18
        262144           65536    fp32        1311           0.19           0.37
        524288          131072    fp32        1312           0.37           0.73
       1048576          262144    fp32        1328           0.74           1.45
       2097152          524288    fp32        1329           1.47           2.89
       4194304         1048576    fp32        1378           2.83           5.58
       8388608         2097152    fp32        1419           5.51          10.84
      16777216         4194304    fp32        2138           7.31          14.39
      33554432         8388608    fp32        2711          11.53          22.69
      67108864        16777216    fp32        3963          15.77          31.05
     134217728        33554432    fp32        6279          19.91          39.19
     268435456        67108864    fp32       11954          20.91          41.17
     536870912       134217728    fp32       21803          22.93          45.15
    1073741824       268435456    fp32       41806          23.92          47.09
Avg bus bandwidth:      9.3924GB/s

This document is relevant for: Inf2, Trn1, Trn2

NCCOM-TEST User Guide

Contents

NCCOM-TEST User Guide#

Overview#

Using nccom-test#

Output description#

CLI arguments#

Examples#

Single Instance Examples#

Multiple Instances Example#

Overview #

Using nccom-test #

Output description #

CLI arguments #

Examples #