.. _neuron-collectives-rn: Neuron Collectives Release Notes ================================ Neuron Collectives refers to a set of libraries used to support collective compute operations within the Neuron SDK. The collectives support is delivered via the aws-neuronx-collectives package and includes a pre-built version of the OFI plugin required for use of collectives with Elastic Fabric Adapter (EFA). .. contents:: Table of contents :local: :depth: 1 Neuron Collectives [2.20.22.0] ------------------------------ Date: 04/01/2024 New in this release: * minor bug fixes and enhancements Neuron Collectives [2.20.11.0] ------------------------------ Date: 02/13/2024 Bug Fixes ^^^^^^^^^ * Require “libatomic” for rpm installs Neuron Collectives [2.19.7.0] ------------------------------ Date: 12/21/2023 New in this release ^^^^^^^^^^^^^^^^^^^ * Improve collectives barrier latency from 500us to 40us Bug Fixes ^^^^^^^^^ * Fix bug where proxy thread blocks the runtime from adding ops leading to an execution hang Neuron Collectives [2.18.18.0] ------------------------------ Date: 10/26/2023 New in this release: * Bumpped compatibility version to 17 to align with struct change in the nec.h header Neuron Collectives [2.17.9.0] ------------------------------ Date: 9/14/2023 New in this release: * minor bug fixes and enhancements Neuron Collectives [2.16.16.0] ------------------------------ Date: 9/01/2023 New in this release: * minor bug fixes and enhancements Neuron Collectives [2.16.8.0] ------------------------------ Date: 8/28/2023 New in this release: * Improved error messages for unsupported topologies * Improved timeout error messages for bootstrapInit Bug Fixes: * Fix bug where Linux kernel version check for SAFE_FORK env variable was incorrectly requiring SAFE_FORK to be set on kernel versions greater than 5 Neuron Collectives [2.15.16.0] ------------------------------ Date: 8/09/2023 New in this release: * minor bug fixes and enhancements Neuron Collectives [2.15.13.0] ------------------------------ Date: 7/19/2023 New in this release: * AllReduce with All-to-all communication pattern enabled for 16 ranks on TRN1/TRN1N within the instance (intranode); choice of 16 ranks is limited to NeuronCores 0-15 or 16-31. Bug Fixes: * Fix incorrect mask calculation for 16 ranks when using NeuronCores 16-31 * Fix channels for 16 ranks to avoid failures in the runtime; restrict participating ranks to 0-15 or 16-31 Neuron Collectives [2.14.9.0] ------------------------------ Date: 6/14/2023 New in this release * Added check for FI_EFA_FORK_SAFE environment variable; now forcing the flag to be set to 1 for multinode runs executing on Linux kernels older than 5.15. Neuron Collectives [2.13.7.0] ------------------------------ Date: 05/01/2023 New in this release * Added support for dma_buf - required for future EFA and Linux kernel updates. * Reduced benign reporting of timeouts. Previous implementations reported “Timeout waiting for incoming connection” too frequently (log spam). Neuron Collectives [2.12.35.0] ------------------------------ Date: 04/19/2023 Bug Fixes * Fixed support for SOCKET_IFNAME config that was affecting EKS users at scale on large training jobs. Neuron Collectives [2.12.22.0] ------------------------------ Date: 03/28/2023 New in this release * Added support for TRN1N. * Added support for 16 channels and 16 EFA devices, which is required for enabling EC2 TRN1N instances with Neuron. * Added support for hierarchical All-Reduce and Reduce-Scatter. These implementations are now used by default and provides up to 75% reduction in latency for 2MB buffers across 256 ranks. Neuron Collectives [2.11.47.0] ------------------------------ Date: 02/08/2023 New in this release * Added support for Inf2. Neuron Collectives [2.10.20.0] ----------------------------- Date: 10/10/2022 New in this release * Improved logging to appear similar in style to Neuron Runtime Bug Fixes * Fixed memory registration to support 2GB+ sizes * Fixed association of network devices to channels (removes previous hard-coding). Neuron Collectives [2.9.86.0] ----------------------------- Date: 10/10/2022 New in this release * Added support for All-Reduce, Reduce-Scatter, All-Gather, and Send/Recv operations.