YOLO v3演算法介紹

2020 年 8 月 25 日
筆記
卷積神經網路, 深度學習

圖片來自//towardsdatascience.com/yolo-v3-object-detection-with-keras-461d2cfccef6

數據前處理

輸入的圖片維數：(416, 416, 3)

輸入的圖片標註：$[(x_1, y_1, x_2, y_2, class{\_}index), (x_1, y_1, x_2, y_2,class{\_}index), \ldots, (x_1, y_1, x_2, y_2,class{\_}index)]$ 表示圖片中標註的所有真實box，其中$class{\_}index$代表對應的box所屬的類別，$(x_1, y_1)$表示對應的box左上角的坐標值，$(x_2, y_2)$表示對應的box右下角的坐標值

YOLO v3共有9個anchor box，每個detector中有3個anchor box。YOLO v3中anchor box的確定方法是對訓練集中的所有真實box進行聚類得到的，聚類距離通過IoU來定義，IoU越大，距離越小：$$d(\text {box}, \text {centroid})=1-\operatorname{IoU}(\text {box}, \text {centroid})$$IoU的定義如下圖所示：

前處理中最重要的一步就是將圖片標註轉化為模型的輸出格式。首先要確定每個box對應哪個anchor box (與box的IoU最大的那個anchor box)，然後將box的資訊寫在對應的anchor box的位置。

###########以下程式碼僅用作說明，並未考慮性能和程式碼結構##########
train_output_sizes = [52, 26, 13]
label = [np.zeros((train_output_sizes[i], train_output_sizes[i], 3, 85))  for i in range(3)]
bboxes_count = np.zeros((3,))
max_bbox_per_scale = 150 #每個detector中具有的真實box的最大數量
bboxes_xywh = [np.zeros((max_bbox_per_scale, 4)) for _ in range(3)]
# YOLO v3默認的9個anchor box的width和height
anchors = [[(10,13), (16,30), (33,23)], [(30,61), (62,45), (59,119)], [(116,90), (156,198), (373,326)]]
# bboxes為一張圖片中標註的所有真實box
for bbox in bboxes:
    bbox_coor = bbox[:4]
    bbox_class_ind = bbox[4]
    #onehot encode for class
    onehot = np.zeros(80, dtype=np.float)
    onehot[bbox_class_ind] = 1.0
    # 將box的坐標(x1,y1,x2,y2)轉換成(xc, yc, width, height)
    bbox_xywh = np.concatenate([(bbox_coor[2:] + bbox_coor[:2]) * 0.5, bbox_coor[2:] - bbox_coor[:2]], axis=-1)
    # 找到和box有最大IoU的anchor box
    iou = []
    for anchors_detector in anchors:
        for anchor in  anchors_detector:
            intersection = min(bbox_xywh[2], anchor[0])*min(bbox_xywh[3], anchor[1])                  
            box_area = bbox_xywh[2]*bbox_xywh[3]
            anchor_area = anchor[0] * anchor[1]
            iou.append(intersection / (box_area + anchor_area - intersection))
    anchor_idx = np.argmax(np.array(iou))
    # 將anchor_idx轉換到對應的輸出位置
    best_detect = int(anchor_idx/3)
    best_anchor = int(anchor_idx % 3)
    scale = int(416/train_output_sizes[best_detect])
    xind, yind = int(bbox_xywh[0]/scale), int(bbox_xywh[1]/scale)
    label[best_detect][yind, xind, best_anchor, 0:4] = bbox_xywh
    label[best_detect][yind, xind, best_anchor, 4:5] = 1.0
    label[best_detect][yind, xind, best_anchor, 5:] = onehot
    # 存儲box的資訊
    bboxes_xywh[best_detect][bboxes_count[best_detect], :4] = bbox_xywh
    bboxes_count[best_detect] += 1
label_sbbox, label_mbbox, label_lbbox = label
sbboxes, mbboxes, lbboxes = bboxes_xywh

View Code

模型架構

YOLO v3的架構搭建主要分為兩個部分，第一部分基於Darknet網路構建52×52, 26×26, 13×13的特徵圖，第二部分構建基於這三類特徵圖的探測器，如下圖所示：

圖片來自//towardsdatascience.com/dive-really-deep-into-yolo-v3-a-beginners-guide-9e3d2666280e (對其中的錯誤進行了修正)

### convolutional and residual blocks
def _conv_block(inp, convs, skip=True):
    x = inp
    count = 0
    for conv in convs:
                # skip over 2 layers
        if count == (len(convs) - 2) and skip:
            skip_connection = x
        count += 1
        if conv['stride'] > 1: x = ZeroPadding2D(((1,0),(1,0)))(x) # left and top padding
        x = Conv2D(conv['filter'],
                   conv['kernel'],
                   strides=conv['stride'],
                   padding='valid' if conv['stride'] > 1 else 'same', 
                   name='conv_' + str(conv['layer_idx']),
                   use_bias=False if conv['bnorm'] else True)(x)
        if conv['bnorm']: x = BatchNormalization(epsilon=0.001, name='bnorm_' + str(conv['layer_idx']))(x)
        if conv['leaky']: x = LeakyReLU(alpha=0.1, name='leaky_' + str(conv['layer_idx']))(x)
    return Add()([skip_connection, x]) if skip else x

### backbone
def make_yolov3_model():
    input_image = Input(shape=(None, None, 3)) #(416, 416,3)
    ###### Part 1 ###### 
    # (208, 208, 64)
    x = _conv_block(input_image, [{'filter': 32, 'kernel': 3, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 0},
                                  {'filter': 64, 'kernel': 3, 'stride': 2, 'bnorm': True, 'leaky': True, 'layer_idx': 1},
                                  {'filter': 32, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 2},
                                  {'filter': 64, 'kernel': 3, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 3}])
    # (104, 104, 128)
    x = _conv_block(x, [{'filter': 128, 'kernel': 3, 'stride': 2, 'bnorm': True, 'leaky': True, 'layer_idx': 5},
                        {'filter':  64, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 6},
                        {'filter': 128, 'kernel': 3, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 7}])
    # (104, 104, 128)
    x = _conv_block(x, [{'filter':  64, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 9},
                        {'filter': 128, 'kernel': 3, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 10}])
    # (52, 52, 256)
    x = _conv_block(x, [{'filter': 256, 'kernel': 3, 'stride': 2, 'bnorm': True, 'leaky': True, 'layer_idx': 12},
                        {'filter': 128, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 13},
                        {'filter': 256, 'kernel': 3, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 14}])
    # (52, 52, 256)
    for i in range(7):
        x = _conv_block(x, [{'filter': 128, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 16+i*3},
                            {'filter': 256, 'kernel': 3, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 17+i*3}])
    skip_36 = x #52x52 feature map
    # (26, 26, 512)
    x = _conv_block(x, [{'filter': 512, 'kernel': 3, 'stride': 2, 'bnorm': True, 'leaky': True, 'layer_idx': 37},
                        {'filter': 256, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 38},
                        {'filter': 512, 'kernel': 3, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 39}])
    # (26, 26, 512)
    for i in range(7):
        x = _conv_block(x, [{'filter': 256, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 41+i*3},
                            {'filter': 512, 'kernel': 3, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 42+i*3}])
    skip_61 = x #26x26 feature map
    # (13, 13, 1024)
    x = _conv_block(x, [{'filter': 1024, 'kernel': 3, 'stride': 2, 'bnorm': True, 'leaky': True, 'layer_idx': 62},
                        {'filter':  512, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 63},
                        {'filter': 1024, 'kernel': 3, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 64}])
    # (13, 13, 1024)
    for i in range(3):
        x = _conv_block(x, [{'filter':  512, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 66+i*3},
                            {'filter': 1024, 'kernel': 3, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 67+i*3}]) #13x13 feature map
    ###### Part 2 ######
    # (13, 13, 512)
    x = _conv_block(x, [{'filter':  512, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 75},
                        {'filter': 1024, 'kernel': 3, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 76},
                        {'filter':  512, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 77},
                        {'filter': 1024, 'kernel': 3, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 78},
                        {'filter':  512, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 79}], skip=False)
    # (13, 13, 255)
    yolo_82 = _conv_block(x, [{'filter': 1024, 'kernel': 3, 'stride': 1, 'bnorm': True,  'leaky': True,  'layer_idx': 80},
                              {'filter':  255, 'kernel': 1, 'stride': 1, 'bnorm': False, 'leaky': False, 'layer_idx': 81}], skip=False) #13x13 detector
    # concatenate with 26x26 feature map, (26, 26, 256+512)
    x = _conv_block(x, [{'filter': 256, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 84}], skip=False)
    x = UpSampling2D(2)(x)
    x = Concatenate()([x, skip_61])
    # (26, 26, 256)
    x = _conv_block(x, [{'filter': 256, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 87},
                        {'filter': 512, 'kernel': 3, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 88},
                        {'filter': 256, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 89},
                        {'filter': 512, 'kernel': 3, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 90},
                        {'filter': 256, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 91}], skip=False)
    # (26, 26, 255)
    yolo_94 = _conv_block(x, [{'filter': 512, 'kernel': 3, 'stride': 1, 'bnorm': True,  'leaky': True,  'layer_idx': 92},
                              {'filter': 255, 'kernel': 1, 'stride': 1, 'bnorm': False, 'leaky': False, 'layer_idx': 93}], skip=False) #26x26 detector
    # concatenate with 52x52 feature map, (52, 52, 128+256)
    x = _conv_block(x, [{'filter': 128, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True,   'layer_idx': 96}], skip=False)
    x = UpSampling2D(2)(x)
    x = Concatenate()([x, skip_36])
    # (52, 52, 255)
    yolo_106 = _conv_block(x, [{'filter': 128, 'kernel': 1, 'stride': 1, 'bnorm': True,  'leaky': True,  'layer_idx': 99},
                               {'filter': 256, 'kernel': 3, 'stride': 1, 'bnorm': True,  'leaky': True,  'layer_idx': 100},
                               {'filter': 128, 'kernel': 1, 'stride': 1, 'bnorm': True,  'leaky': True,  'layer_idx': 101},
                               {'filter': 256, 'kernel': 3, 'stride': 1, 'bnorm': True,  'leaky': True,  'layer_idx': 102},
                               {'filter': 128, 'kernel': 1, 'stride': 1, 'bnorm': True,  'leaky': True,  'layer_idx': 103},
                               {'filter': 256, 'kernel': 3, 'stride': 1, 'bnorm': True,  'leaky': True,  'layer_idx': 104},
                               {'filter': 255, 'kernel': 1, 'stride': 1, 'bnorm': False, 'leaky': False, 'layer_idx': 105}], skip=False) #52x52 detector
    model = Model(input_image, [yolo_82, yolo_94, yolo_106])
    return model

View Code

損失函數

YOLO v3中的損失函數有許多不同的變種，這裡選取一種比較經典的進行介紹。損失函數可以分解為邊框損失、目標損失以及分類損失這三項之和，下面對這三項逐一進行介紹。

邊框損失：原始論文中使用MSE(均方誤差)作為邊框的損失函數，但是不同品質的預測結果，利用MSE有時候並不能區分開來。使用IoU更能體現回歸框的品質，並且具有尺度不變性，但是IoU僅能描述兩個邊框重疊的面積，不能描述兩個邊框重疊的形式；並且若兩個邊框完全不相交，則IoU為0，不適合繼續進行梯度優化。GIoU (Generalized IoU)繼承了IoU的優點，並且一定程度上解決了IoU存在的問題：$$G I o U=I o U-\frac{|C \backslash(B_1 \cup B_2)|}{|C|}$$其中$C$為包含$B_1$與$B_2$的最小封閉形狀。邊框損失可表示為$1-G I o U$，下面以13×13這一檢測器為例來計算邊框損失，總的邊框損失為三個檢測器的損失之和。

要計算邊框損失，首先要對YOLO v3的網路輸出進行轉換，假設網路輸出的邊框資訊為$(t_x,t_y,t_w,t_h)$，其中$(t_x,t_y)$為邊框中心點資訊，$(t_w,t_h)$為邊框的寬度和高度資訊，轉換公式如下所示：$$b_x=sigmoid(t_x)+c_x;\text{ }b_y=sigmoid(t_y)+c_y;\text{ }b_w=p_wexp(t_w);\text{ }b_h=p_hexp(t_h)$$其中$(c_x,c_y)$表示$(t_x,t_y)$所在網格的左上角那個點的坐標位置，$(p_w,p_h)$表示邊框對應的anchor box的寬度和高度。

output_size = 13
anchors = np.array([[116,90], [156,198], [373,326]]) #anchor boxes in 13x13 detector, 參見數據前處理程式碼部分  
# yolo_82_batch: 13x13 detector output, (batch_size, 13, 13, 255), yolo_82的計算參見模型架構程式碼部分 
conv_output = tf.reshape(yolo_82_batch, (batch_size, output_size, output_size, 3, 85))  #(batch_size, 13, 13, 3, 85)
t_xy, t_wh, objectness, classes = tf.split(conv_output, (2, 2, 1, 80), axis=-1) #t_xy:(batch_size, 13, 13, 3, 2); t_wh:(batch_size, 13, 13, 3, 2)
c_xy = tf.meshgrid(tf.range(output_size), tf.range(output_size)) #a list of two (13,13) arrays 
c_xy = tf.stack(c_xy, axis=-1) #(13,13,2)
c_xy = tf.tile(c_xy[tf.newaxis, :, :, tf.newaxis, :], [batch_size, 1, 1, 3, 1]) #(batch_size,13,13,3,2)
scale = int(416/output_size)
b_xy = (tf.sigmoid(t_xy) + c_xy) * scale #(batch_size,13,13,3,2)
b_wh = tf.exp(t_wh) * anchors #(batch_size,13,13,3,2)
b_xywh = tf.concat([b_xy, b_wh], axis=-1) #(batch_size,13,13,3,4)

View Code

接下來計算網路輸出的邊框與真實邊框的GIoU，進而得到邊框損失：

def bbox_giou(boxes1, boxes2):
    # transform from (xc, yc, w, h) to (xmin, ymin, xmax, ymax)
    boxes1 = tf.concat([boxes1[..., :2] - boxes1[..., 2:] * 0.5,
                        boxes1[..., :2] + boxes1[..., 2:] * 0.5], axis=-1)
    boxes2 = tf.concat([boxes2[..., :2] - boxes2[..., 2:] * 0.5,
                        boxes2[..., :2] + boxes2[..., 2:] * 0.5], axis=-1)
    # two box aeras
    boxes1_area = (boxes1[..., 2] - boxes1[..., 0]) * (boxes1[..., 3] - boxes1[..., 1])
    boxes2_area = (boxes2[..., 2] - boxes2[..., 0]) * (boxes2[..., 3] - boxes2[..., 1])
    # intersection area
    left_up = tf.maximum(boxes1[..., :2], boxes2[..., :2])
    right_down = tf.minimum(boxes1[..., 2:], boxes2[..., 2:])
    inter_section = tf.maximum(right_down - left_up, 0.0)
    inter_area = inter_section[..., 0] * inter_section[..., 1]
    # compute iou
    union_area = boxes1_area + boxes2_area - inter_area
    iou = inter_area / union_area
    # enclosed area
    enclose_left_up = tf.minimum(boxes1[..., :2], boxes2[..., :2])
    enclose_right_down = tf.maximum(boxes1[..., 2:], boxes2[..., 2:])
    enclose = tf.maximum(enclose_right_down - enclose_left_up, 0.0)
    enclose_area = enclose[..., 0] * enclose[..., 1]
    # compute giou
    giou = iou - 1.0 * (enclose_area - union_area) / enclose_area
    return giou

### label_lbbox_batch: ground truth boxes in 13x13 detector, (batch_size, 13, 13, 3, 85), label_lbbox的計算參見數據前處理程式碼部分
label_xywh    = label_lbbox_batch[:, :, :, :, 0:4] #ground truth box (xc, yc, w, h)
respond_bbox  = label_lbbox_batch[:, :, :, :, 4:5] #對應的anchor box內是否有真實對象存在，為1則計算邊框損失，為0則忽略
giou = tf.expand_dims(bbox_giou(b_xywh, label_xywh), axis=-1) #(batch_size, 13, 13, 3, 1)
input_size = tf.cast(416, tf.float32)
bbox_loss_scale = 2.0 - 1.0 * label_xywh[:, :, :, :, 2:3] * label_xywh[:, :, :, :, 3:4] / (input_size ** 2) #邊框損失的權重，對應的ground truth box的面積越大，對錯誤的容忍率越高，賦予的權重越小
giou_loss = respond_bbox * bbox_loss_scale * (1- giou) #giou loss, (batch_size, 13, 13, 3, 1)

View Code

目標損失：仍以13×13這一檢測器為例來計算，目標損失實際上是一個不平衡二分類問題，因為一般來說檢測器的13x13x3個anchor box內真實對象（正樣本）的數量要遠小於沒有真實對象（負樣本）的數量。採用Focal Loss來處理這一問題，Focal Loss對難分類的樣本採用較大的權重，對易分類的樣本採用較小的權重：$$F L(p)=\left\{\begin{aligned}-(1-p)^{\gamma} \log (p), & \text { positive samples } \\ -p^{\gamma} \log (1-p), & \text { negative samples }\end{aligned}\right.$$Focal Loss還有另外一種公式，即在上述基礎上引入類別權重$\alpha$：$$F L(p)=\left\{\begin{aligned}-\alpha(1-p)^{\gamma} \log (p), & \text { positive samples} \\ -(1-\alpha) p^{\gamma} \log (1-p), & \text { negative samples }\end{aligned}\right.$$本文採用第一種公式，並將$\gamma$設為2。另外在目標損失的計算過程中對負樣本的定義進行了一定修改，如果一個anchor box內沒有真實對象，但它預測的邊框和對應的探測器上的某個真實邊框有較大的IoU，那麼就不把它作為負樣本，從而在損失計算過程中忽略它，這也在一定程度上減少了負樣本的數量。

### lbboxes_batch: 13x13探測器上存在的所有ground truth box的(xc,yc,w,h)資訊, (batch_size, max_bbox_per_scale, 4), lbboxes的計算參見數據前處理程式碼部分
### label_lbbox_batch: ground truth boxes in 13x13 detector, (batch_size, 13, 13, 3, 85), label_lbbox的計算參見數據前處理程式碼部分
### objectness：預測的對象真實度, (batch_size, 13, 13, 3, 1), 參見邊框損失中的輸出轉換程式碼部分
### b_xywh: 預測的邊框資訊，(batch_size, 13, 13, 3, 4), 參見邊框損失中的輸出轉換程式碼部分
respond_bbox = label_lbbox_batch[:, :, :, :, 4:5] #對應的anchor box內是否有真實對象存在，為1則為正樣本，為0則為負樣本
### 減少進行計算的負樣本的數量 ###
### 1. 計算預測的box與所有真實box的IoU ###
boxes1 = tf.tile(lbboxes_batch[:, tf.newaxis, tf.newaxis, tf.newaxis, :, :], [1, 13, 13, 3, 1, 1]) #(batch_size, 13, 13, 3, max_bbox_per_scale, 4)
boxes2 = tf.tile(b_xywh[:, :, :, :, tf.newaxis, :], [1, 1, 1, 1, max_bbox_per_scale, 1]) #(batch_size, 13, 13, 3, max_bbox_per_scale, 4)
boxes1_area = boxes1[..., 2] * boxes1[..., 3]
boxes2_area = boxes2[..., 2] * boxes2[..., 3]
# (xc, yc, w, h)->(xmin, ymin, xmax, ymax)
boxes1 = tf.concat([boxes1[..., :2] - boxes1[..., 2:] * 0.5, boxes1[..., :2] + boxes1[..., 2:] * 0.5], axis=-1)
boxes2 = tf.concat([boxes2[..., :2] - boxes2[..., 2:] * 0.5, boxes2[..., :2] + boxes2[..., 2:] * 0.5], axis=-1)
# compute IoU
left_up = tf.maximum(boxes1[..., :2], boxes2[..., :2])
right_down = tf.minimum(boxes1[..., 2:], boxes2[..., 2:])
inter_section = tf.maximum(right_down - left_up, 0.0)
inter_area = inter_section[..., 0] * inter_section[..., 1]
union_area = boxes1_area + boxes2_area - inter_area
iou = 1.0 * inter_area / union_area #(batch_size, 13, 13, 3, max_bbox_per_scale)
### 2. 尋找最大的IoU，若該值大於給定的臨界值，則在損失計算中忽略該樣本###
max_iou = tf.expand_dims(tf.reduce_max(iou, axis=-1), axis=-1) #(batch_size, 13, 13, 3, 1)
IOU_LOSS_THRESH = 0.5
respond_bgd = (1.0 - respond_bbox) * tf.cast( max_iou < IOU_LOSS_THRESH, tf.float32) #(batch_size, 13, 13, 3, 1)
###########################
pred_conf  = tf.sigmoid(objectness) #預測為真實對象的概率
conf_focal = tf.pow(respond_bbox - pred_conf, 2) #gamma=2
focal_loss_p = conf_focal * respond_bbox * tf.nn.sigmoid_cross_entropy_with_logits(labels=respond_bbox, logits=objectness) #正樣本損失
focal_loss_n = conf_focal * respond_bgd * tf.nn.sigmoid_cross_entropy_with_logits(labels=respond_bbox, logits=objectness) #負樣本損失
focal_loss = focal_loss_p + focal_loss_n #(batch_size, 13, 13, 3, 1)

View Code

分類損失：仍以13×13這一檢測器為例來計算，使用交叉熵損失函數，值得注意的是在YOLO v3的類別預測中使用sigmoid作為激活函數代替之前的softmax，主要是因為不同的類別不一定是互斥的，一個對象可能會同時屬於多個類別。

### label_lbbox_batch: ground truth boxes in 13x13 detector, (batch_size, 13, 13, 3, 85), label_lbbox的計算參見數據前處理程式碼部分
### classes: 預測對象所屬的類別，(batch_size, 13, 13, 3, 80), 參見邊框損失中的輸出轉換程式碼部分
respond_bbox  = label_lbbox_batch[:, :, :, :, 4:5] #對應的anchor box內是否有真實對象存在，為1則計算分類損失，為0則忽略
labels_onehot = label_lbbox_batch[:, :, :, :, 5:] #對象所屬的真實類別 
classes_prob  = tf.sigmoid(classes) #預測屬於每個類別的概率
ce_loss = respond_bbox * tf.nn.sigmoid_cross_entropy_with_logits(labels=labels_onehot, logits=classes) #cross entropy loss, (batch_size, 13, 13, 3, 80)

View Code

綜合上述的三類損失，可以計算出在13×13探測器上的總損失，其餘兩個探測器 (26×26, 52×52) 上的損失可採取同樣的方法計算，三個探測器的總損失為：

giou_loss_13 = tf.reduce_mean(tf.reduce_sum(giou_loss, axis=[1,2,3,4]))
focal_loss_13 = tf.reduce_mean(tf.reduce_sum(focal_loss, axis=[1,2,3,4]))
ce_loss_13 = tf.reduce_mean(tf.reduce_sum(ce_loss, axis=[1,2,3,4]))
total_loss_13 = giou_loss_13 + focal_loss_13 + ce_loss_13
# total loss 
total_loss = total_loss_13 + total_loss_26 + total_loss_52

View Code

模型預測

同損失函數這一部分中的介紹，首先將網路輸出的格式進行轉換：

### 仍以13x13檢測器為例, 輸入的待預測圖片的維數為(1, 416, 416, 3)
### b_xywh, pred_conf, classes_prob的計算參見損失函數程式碼部分 
output_13 = tf.concat([b_xywh, pred_conf, classes_prob], axis=-1) #(batch_size, 13, 13, 3, 85)，此時的batch_size為1
### 同樣的方式可以計算出output_26 (26x26檢測器), output_52 (52x52檢測器)
### output_26: (1, 26, 26, 3, 85); output_52: (1, 52, 52, 3, 85)
pred_bbox = [tf.reshape(x, (-1, tf.shape(x)[-1])) for x in (output_13, output_26, output_52)] #[(13*13*3, 85), (26*26*3, 85), (52*52*3, 85)]
pred_bbox = tf.concat(pred_bbox, axis=0) #預測得到的所有box的資訊, (13*13*3+26*26*3+52*52*3, 85)

View Code

接下來刪除得分較低的預測box，得分通過box內為真實對象的概率乘以最大的類別概率進行確定

score_threshold = 0.5
pred_xywh = pred_bbox[:, 0:4]
# (xc, yc, w, h) --> (xmin, ymin, xmax, ymax) for computing IoU
pred_coor = np.concatenate([pred_xywh[:, :2] - pred_xywh[:, 2:] * 0.5, pred_xywh[:, :2] + pred_xywh[:, 2:] * 0.5], axis=-1)
# compute box score
pred_conf = pred_bbox[:, 4]
pred_prob = pred_bbox[:, 5:]
classes = np.argmax(pred_prob, axis=-1) #每個box預測的對應最大概率的類別
scores = pred_conf * np.max(pred_prob, axis=-1)
# discard boxes with low scores
mask = scores > score_threshold
coors, scores, classes = pred_coor[mask], scores[mask], classes[mask]
filter_boxes = np.concatenate([coors, scores[:, np.newaxis], classes[:, np.newaxis]], axis=-1) #(number of remaining boxes, 6)

View Code

對剩餘的預測box進行Non-Maximum Suppression (NMS)，NMS的主要目的是去除預測類別相同但是重疊度比較大的box:

iou_threshold = 0.5
classes_in_img = list(set(filter_boxes[:, 5])) #圖片上的所有預測類別
best_bboxes = [] #最終剩餘的box
for cls in classes_in_img:
    cls_mask = (filter_boxes[:, 5] == cls)
    cls_bboxes = filter_boxes[cls_mask] #預測為同一類別的所有box 
    while len(cls_bboxes) > 0:
        max_ind = np.argmax(cls_bboxes[:, 4]) 
        best_bbox = cls_bboxes[max_ind] #剩餘box中得分最高的box
        best_bboxes.append(best_bbox)
        ### 計算得分最高的box與剩餘box的IoU ###
        cls_bboxes = np.concatenate([cls_bboxes[: max_ind], cls_bboxes[max_ind + 1:]], axis=0) #剩餘box (不包括得分最高的box)
        best_bbox_xy = best_bbox[np.newaxis, :4]
        cls_bboxes_xy = cls_bboxes[:, :4]
        ### IoU  
        best_bbox_area = (best_bbox_xy[:, 2] - best_bbox_xy[:, 0]) * (best_bbox_xy[:, 3] - best_bbox_xy[:, 1])
        cls_bboxes_area = (cls_bboxes_xy[:, 2] - cls_bboxes_xy[:, 0]) * (cls_bboxes_xy[:, 3] - cls_bboxes_xy[:, 1])
        left_up = np.maximum(best_bbox_xy[:, :2], cls_bboxes_xy[:, :2])
        right_down = np.minimum(best_bbox_xy[:, 2:], cls_bboxes_xy[:, 2:])
        inter_section = np.maximum(right_down - left_up, 0.0)
        inter_area = inter_section[:, 0] * inter_section[:, 1]
        union_area = cls_bboxes_area + best_bbox_area - inter_area
        ious = 1.0 * inter_area / union_area
        ### 刪除與得分最高的box的IoU較大的box ###
        iou_mask = ious < iou_threshold
        cls_bboxes = cls_bboxes[iou_mask]

View Code

參考資料

本文主要針對YOLO v3的細節進行了總結，對目標檢測以及YOLO系列的基礎概念並未詳細說明，這部分可以參考Coursera上深度學習專項課程中的Convolutional Neural Networks
本文的程式碼部分僅是為了用於說明，並沒有進行系統地設計和優化，YOLO v3的程式碼可以參考以下鏈接
- //github.com/YunYang1994/TensorFlow2.0-Examples/tree/master/4-Object_Detection/YOLOV3
- //github.com/experiencor/keras-yolo3
其它資料

Tags: 卷積神經網路深度學習