This document is relevant for: Inf1, Inf2, Trn1, Trn1n

Neuron monitor is primary observability tool for neuron devices. For details of neuron monitor, please refer to the neuron monitor guide. This tutorial describes deploying neuron monitor as a daemonset on the kubernetes cluster.

Deploy Neuron Monitor Daemonset#

  • Download the neuron monitor yaml file. k8s-neuron-monitor-daemonset.yml

  • Apply the Neuron monitor yaml to create a daemonset on the cluster with the following command

    kubectl apply -f k8s-neuron-monitor.yml
    
  • Verify that neuron monitor daemonset is running

    kubectl get ds neuron-monitor --namespace neuron-monitor
    

    Expected result (with 2 nodes in cluster):

    NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
    neuron-monitor                     2         2         2       2            2           <none>          27h
    
  • Get the neuron-monitor pod names

    kubectl get pods
    

    Expected result

    NAME                   READY   STATUS    RESTARTS   AGE
    neuron-monitor-slsxf   1/1     Running   0          17m
    neuron-monitor-wc4f5   1/1     Running   0          17m
    
  • Verify the prometheus endpoint is available

    kubectl exec neuron-monitor-wc4f5 -- wget -q --output-document - http://127.0.0.1:8000
    

    Expected result

    # HELP python_gc_objects_collected_total Objects collected during gc
    # TYPE python_gc_objects_collected_total counter
    python_gc_objects_collected_total{generation="0"} 362.0
    python_gc_objects_collected_total{generation="1"} 0.0
    python_gc_objects_collected_total{generation="2"} 0.0
    # HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
    # TYPE python_gc_objects_uncollectable_total counter
    

This document is relevant for: Inf1, Inf2, Trn1, Trn1n