.. _tensorflow-yolo4: Working with YOLO v4 using AWS Neuron SDK ========================================= The :ref:`/src/examples/tensorflow/yolo_v4_demo/evaluate.ipynb` notebook contains an example on how to take an open source YOLO v4 models, and run it on AWS Inferentia. Optimizing image pre-processing and post-processing for object detection models ------------------------------------------------------------------------------- End-to-end object detection pipelines usually contain image pre-post-processing operators that cannot run efficiently on Inferentia. DecodeJPEG and NonMaxSuppression are typical examples. In practice, we may simply place these operators on CPU using the AWS Neuron machine learning framework integration. However, Inferentia is such a high performance machine learning accelerator that, once the model successfully compiles and runs, these simple pre-post-processing operators can become the new performance bottleneck! In this tutorial, we explain some commonly used tensorflow techniques for optimizing the performance of these pre-post-processing operators so that we can fully unleash the potential of Inferentia. 1. Write JPEG decoding and image shifting/scaling as tensorflow operators. In ``yolo_v4_coco_saved_model.py``, you may find the following code snippet. .. code:: python import tensorflow as tf ... def YOLOv4(... ... x, image_shape = layers.Lambda(lambda t: preprocessor(t, input_shape))(inputs) # cspdarknet53 x = conv2d_unit(x, i32, 3, strides=1, padding='same') ... def decode_jpeg_resize(input_tensor, image_size): tensor = tf.image.decode_png(input_tensor, channels=3) shape = tf.shape(tensor) tensor = tf.cast(tensor, tf.float32) tensor = tf.image.resize(tensor, image_size) tensor /= 255.0 return tf.cast(tensor, tf.float16), shape def preprocessor(input_tensor, image_size): with tf.name_scope('Preprocessor'): tensor = tf.map_fn( partial(decode_jpeg_resize, image_size=image_size), input_tensor, dtype=(tf.float16, tf.int32), back_prop=False, parallel_iterations=16) return tensor Comparing with the implementation in `the original repo `__, our difference is the use of ``tf.image.decode_png`` and ``tf.image.resize``, along with a small number of scaling/casting operators. After this modification, the generated tensorflow SavedModel now takes JPEG image raw bytes as input, instead of a float32 array representing the image. When the image resolution is 608x608, this technique effectively reduces the input image size from 4.4 MB to the size of a typical JPEG image, which can be as little as hundreds of KB. When the tensorflow SavedModel is deployed through `tensorflow/serving `__, this technique can very effectively reduce the gRPC transfer overhead of input images. 2. Replace non-max suppression (NMS) operations by ``tf.image.combined_non_max_suppression``. Another difference of our implementation is the treatment of non-max suppression, a commmonly used operation for removing redundant bounding boxes that overlap with other boxes. In an object detection scenario represented by the COCO dataset where the number of output classes is large, the hand-fused :literal:`\`tf.image.combined_non_max_suppression` `_\_ operator can parallelize multi-class NMS on CPU in a very efficient manner. With proper use of this operator, the bounding box post-processing step has a less chance of becoming the performance bottleneck in the end-to-end object detection pipeline. The following sample code (from ``yolo_v4_coco_saved_model.py``) demonstrates our method of writing the bounding box post-processing step using efficient tensorflow operations. .. code:: python ... def filter_boxes(outputs): boxes_l, boxes_m, boxes_s, box_scores_l, box_scores_m, box_scores_s, image_shape = outputs boxes_l, box_scores_l = filter_boxes_one_size(boxes_l, box_scores_l) boxes_m, box_scores_m = filter_boxes_one_size(boxes_m, box_scores_m) boxes_s, box_scores_s = filter_boxes_one_size(boxes_s, box_scores_s) boxes = tf.concat([boxes_l, boxes_m, boxes_s], axis=0) box_scores = tf.concat([box_scores_l, box_scores_m, box_scores_s], axis=0) image_shape_wh = image_shape[1::-1] image_shape_whwh = tf.concat([image_shape_wh, image_shape_wh], axis=-1) image_shape_whwh = tf.cast(image_shape_whwh, tf.float32) boxes *= image_shape_whwh boxes = tf.expand_dims(boxes, 0) box_scores = tf.expand_dims(box_scores, 0) boxes = tf.expand_dims(boxes, 2) nms_boxes, nms_scores, nms_classes, valid_detections = tf.image.combined_non_max_suppression( boxes, box_scores, max_output_size_per_class=nms_top_k, max_total_size=nms_top_k, iou_threshold=nms_thresh, score_threshold=conf_thresh, pad_per_class=False, clip_boxes=False, name='CombinedNonMaxSuppression', ) return nms_boxes[0], nms_scores[0], nms_classes[0] def filter_boxes_one_size(boxes, box_scores): box_class_scores = tf.reduce_max(box_scores, axis=-1) keep = box_class_scores > conf_thresh boxes = boxes[keep] box_scores = box_scores[keep] return boxes, box_scores def batch_yolo_out(outputs): with tf.name_scope('yolo_out'): b_output_lr, b_output_mr, b_output_sr, b_image_shape = outputs with tf.name_scope('process_feats'): b_boxes_l, b_box_scores_l = batch_process_feats(b_output_lr, anchors, masks[0]) with tf.name_scope('process_feats'): b_boxes_m, b_box_scores_m = batch_process_feats(b_output_mr, anchors, masks[1]) with tf.name_scope('process_feats'): b_boxes_s, b_box_scores_s = batch_process_feats(b_output_sr, anchors, masks[2]) with tf.name_scope('filter_boxes'): b_nms_boxes, b_nms_scores, b_nms_classes = tf.map_fn( filter_boxes, [b_boxes_l, b_boxes_m, b_boxes_s, b_box_scores_l, b_box_scores_m, b_box_scores_s, b_image_shape], dtype=(tf.float32, tf.float32, tf.float32), back_prop=False, parallel_iterations=16) return b_nms_boxes, b_nms_scores, b_nms_classes boxes_scores_classes = layers.Lambda(batch_yolo_out)([output_lr, output_mr, output_sr, image_shape]) ... For other advanced data input/output pipeline optimization techniques, please refer to https://www.tensorflow.org/guide/data#preprocessing_data.