This document is relevant for: Inf1
Neuron Scheduler Extension Flow Diagram#
+----------------------------+
| POD Manifest |
| with Request |
| aws.amazon.com/neuroncore:2|
| |
| |
2 +-------------+--------------+
+--------------------------------+ |
| | |
| | | 3
+------------------------------+-----+ | |
| Kubelet in INF1/TRN1 Node| | |
| +<-----------+ | |
+-----+---------------------+--------+ | +-----v-----------v--------------+
| ^ | | Kube-Scheduler |
| | | | |
| | | +--^------+---------------+------+
9 | 1 | | | | |
| | 8| 5| |4 |
| | | | | |
| | | | | |6
v | | | | |
+-----+---------------------+--------+ | +--+------v---------------v------+
| neuron-device-plugin | +-------+ neuron|scheduler|ext |
| in INF1/TRN1 node | +---------------------+----------+
+----+----------------------+--------+ |
| | |7
| |10 |
| | v
11| | +---------+-------+
| | |POD Manifest: |
| | |Annotation: |
| | |NEURON_CORES:2,3 |
v +---------------------------------------->+ |
--device=/dev/neuron1 --env NEURON_RT_VISIBLE_CORES=2,3 | |
| |
+-----------------+
1. neuron-device-plugin returns the list of Neuron cores/devices to kublet
2. Kubelet advertises the Core/Device list to K8s API server (in turn to kube-scheduler)
3. POD Request for neuron cores/devices [Kube-Scheduler picks up the POD creation request]
4. kube-scheduler calls the neuron-scheduler-extn filter function with list of nodes and POD Specification
5. neuron-scheduler-extn scans through the nodes and filters out nodes with non
contiguous cores/devices and returns the nodes that are capable of supporing the given POD specification
6. kube-scheduler calls the neuron-scheduler-extn bind function with pod and node
7. neuron-scheduler-extn updates the POD annotation with allocated neuron core/device Ids (contiguous)
8. neuron-scheduler-extn sends the bind request to kubelet of the selected node
9. Kubelet calls the Alloc function of the neuron-device-plugin
10. neuron-device-plugin queries the POD Annotation for allocated core/device Ids
11. neuron-device-plugin exports the devices & visisble cores to container runtime
This document is relevant for: Inf1