This document is relevant for: Inf1

Working with YOLO v4 using AWS Neuron SDK#

The Evaluate YOLO v4 on Inferentia notebook contains an example on how to take an open source YOLO v4 models, and run it on AWS Inferentia.

Optimizing image pre-processing and post-processing for object detection models#

End-to-end object detection pipelines usually contain image pre-post-processing operators that cannot run efficiently on Inferentia. DecodeJPEG and NonMaxSuppression are typical examples. In practice, we may simply place these operators on CPU using the AWS Neuron machine learning framework integration. However, Inferentia is such a high performance machine learning accelerator that, once the model successfully compiles and runs, these simple pre-post-processing operators can become the new performance bottleneck! In this tutorial, we explain some commonly used tensorflow techniques for optimizing the performance of these pre-post-processing operators so that we can fully unleash the potential of Inferentia.

  1. Write JPEG decoding and image shifting/scaling as tensorflow operators.

In, you may find the following code snippet.

import tensorflow as tf

def YOLOv4(...
    x, image_shape = layers.Lambda(lambda t: preprocessor(t, input_shape))(inputs)

    # cspdarknet53
    x = conv2d_unit(x, i32, 3, strides=1, padding='same')

def decode_jpeg_resize(input_tensor, image_size):
    tensor = tf.image.decode_png(input_tensor, channels=3)
    shape = tf.shape(tensor)
    tensor = tf.cast(tensor, tf.float32)
    tensor = tf.image.resize(tensor, image_size)
    tensor /= 255.0
    return tf.cast(tensor, tf.float16), shape

def preprocessor(input_tensor, image_size):
    with tf.name_scope('Preprocessor'):
        tensor = tf.map_fn(
            partial(decode_jpeg_resize, image_size=image_size), input_tensor,
            dtype=(tf.float16, tf.int32), back_prop=False, parallel_iterations=16)
    return tensor

Comparing with the implementation in the original repo, our difference is the use of tf.image.decode_png and tf.image.resize, along with a small number of scaling/casting operators. After this modification, the generated tensorflow SavedModel now takes JPEG image raw bytes as input, instead of a float32 array representing the image. When the image resolution is 608x608, this technique effectively reduces the input image size from 4.4 MB to the size of a typical JPEG image, which can be as little as hundreds of KB. When the tensorflow SavedModel is deployed through tensorflow/serving, this technique can very effectively reduce the gRPC transfer overhead of input images.

  1. Replace non-max suppression (NMS) operations by tf.image.combined_non_max_suppression.

Another difference of our implementation is the treatment of non-max suppression, a commmonly used operation for removing redundant bounding boxes that overlap with other boxes. In an object detection scenario represented by the COCO dataset where the number of output classes is large, the hand-fused `tf.image.combined_non_max_suppression <>`__ operator can parallelize multi-class NMS on CPU in a very efficient manner. With proper use of this operator, the bounding box post-processing step has a less chance of becoming the performance bottleneck in the end-to-end object detection pipeline.

The following sample code (from demonstrates our method of writing the bounding box post-processing step using efficient tensorflow operations.

    def filter_boxes(outputs):
        boxes_l, boxes_m, boxes_s, box_scores_l, box_scores_m, box_scores_s, image_shape = outputs
        boxes_l, box_scores_l = filter_boxes_one_size(boxes_l, box_scores_l)
        boxes_m, box_scores_m = filter_boxes_one_size(boxes_m, box_scores_m)
        boxes_s, box_scores_s = filter_boxes_one_size(boxes_s, box_scores_s)
        boxes = tf.concat([boxes_l, boxes_m, boxes_s], axis=0)
        box_scores = tf.concat([box_scores_l, box_scores_m, box_scores_s], axis=0)
        image_shape_wh = image_shape[1::-1]
        image_shape_whwh = tf.concat([image_shape_wh, image_shape_wh], axis=-1)
        image_shape_whwh = tf.cast(image_shape_whwh, tf.float32)
        boxes *= image_shape_whwh
        boxes = tf.expand_dims(boxes, 0)
        box_scores = tf.expand_dims(box_scores, 0)
        boxes = tf.expand_dims(boxes, 2)
        nms_boxes, nms_scores, nms_classes, valid_detections = tf.image.combined_non_max_suppression(
        return nms_boxes[0], nms_scores[0], nms_classes[0]

    def filter_boxes_one_size(boxes, box_scores):
        box_class_scores = tf.reduce_max(box_scores, axis=-1)
        keep = box_class_scores > conf_thresh
        boxes = boxes[keep]
        box_scores = box_scores[keep]
        return boxes, box_scores

    def batch_yolo_out(outputs):
        with tf.name_scope('yolo_out'):
            b_output_lr, b_output_mr, b_output_sr, b_image_shape = outputs
            with tf.name_scope('process_feats'):
                b_boxes_l, b_box_scores_l = batch_process_feats(b_output_lr, anchors, masks[0])
            with tf.name_scope('process_feats'):
                b_boxes_m, b_box_scores_m = batch_process_feats(b_output_mr, anchors, masks[1])
            with tf.name_scope('process_feats'):
                b_boxes_s, b_box_scores_s = batch_process_feats(b_output_sr, anchors, masks[2])
            with tf.name_scope('filter_boxes'):
                b_nms_boxes, b_nms_scores, b_nms_classes = tf.map_fn(
                    filter_boxes, [b_boxes_l, b_boxes_m, b_boxes_s, b_box_scores_l, b_box_scores_m, b_box_scores_s, b_image_shape],
                    dtype=(tf.float32, tf.float32, tf.float32), back_prop=False, parallel_iterations=16)
        return b_nms_boxes, b_nms_scores, b_nms_classes

    boxes_scores_classes = layers.Lambda(batch_yolo_out)([output_lr, output_mr, output_sr, image_shape])

For other advanced data input/output pipeline optimization techniques, please refer to

This document is relevant for: Inf1