目標檢測演算法之SSD程式碼解析(萬字長文超詳細)

  • 2019 年 12 月 4 日
  • 筆記

前言

前面的推文已經介紹過SSD演算法,我覺得原理說的還算清楚了,但是一個演算法不深入到程式碼去理解是完全不夠的。因此本篇文章是在上篇SSD演算法原理解析的基礎上做的程式碼解析,解析SSD演算法原理的推文的地址如下:https://mp.weixin.qq.com/s/lXqobT45S1wz-evc7KO5DA。今天要解析的SSD源碼來自於github一個非常火的Pytorch實現,已經有3K+星,地址為:https://github.com/amdegroot/ssd.pytorch/

網路結構

為了比較好的對應SSD的結構來看程式碼,我們首先放出SSD的網路結構,如下圖所示:

可以看到原始的SSD網路是以VGG-16作Backbone(骨幹網路)的。為了更加清晰看到相比於VGG16,SSD的網路使用了哪些變化,知乎上的一個帖子做了一個非常清晰的圖,這裡借用一下,原圖地址為:https://zhuanlan.zhihu.com/p/79854543 。帶有特徵圖維度資訊的更清晰的骨幹網路和VGG16的對比圖如下:

源碼解析

OK,現在我們就要開始從源碼剖析SSD了 。主要弄清楚三個方面,網路結構的搭建,Anchor還有損失函數,就算是理解這個源碼了。

網路搭建

從上面的圖中我們可以清晰的看到在以VGG16做骨幹網路時,在conv5後丟棄了CGG16中的全連接層改為了和的卷積層。其中conv4-1卷積層前面的maxpooling層的ceil_model=True,使得輸出特徵圖長寬為。還有conv5-3後面的一層maxpooling層參數為,不進行下取樣。然後在fc7後面接上多尺度提取的另外4個卷積層就構成了完整的SSD網路。這裡VGG16修改後的程式碼如下,來自ssd.py:

def vgg(cfg, i, batch_norm=False):      layers = []      in_channels = i      for v in cfg:          if v == 'M':              layers += [nn.MaxPool2d(kernel_size=2, stride=2)]          elif v == 'C':              layers += [nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True)]          else:              conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)              if batch_norm:                  layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]              else:                  layers += [conv2d, nn.ReLU(inplace=True)]              in_channels = v      pool5 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)      conv6 = nn.Conv2d(512, 1024, kernel_size=3, padding=6, dilation=6)      conv7 = nn.Conv2d(1024, 1024, kernel_size=1)      layers += [pool5, conv6,                 nn.ReLU(inplace=True), conv7, nn.ReLU(inplace=True)]      return layers    

可以看到和我們上面的那張圖是完全一致的。程式碼裡面最後獲得的conv7就是我們上面圖裡面的fc7,特徵維度是:。現在可以開始搭建SSD網路後面的多尺度提取網路了。也就是網路結構圖中的Extra Feature Layers。我們從開篇的結構圖中截取一下這一部分,方便我們對照程式碼。

實現的程式碼如下(同樣來自ssd.py):

def add_extras(cfg, i, batch_norm=False):      # Extra layers added to VGG for feature scaling      layers = []      in_channels = i      flag = False #flag 用來控制 kernel_size= 1 or 3      for k, v in enumerate(cfg):          if in_channels != 'S':              if v == 'S':                  layers += [nn.Conv2d(in_channels, cfg[k + 1],                             kernel_size=(1, 3)[flag], stride=2, padding=1)]              else:                  layers += [nn.Conv2d(in_channels, v, kernel_size=(1, 3)[flag])]              flag = not flag          in_channels = v  return layers  

可以看到網路結構中除了魔改後的VGG16和Extra Layers還有6個橫著的線,這代表的是對6個尺度的特徵圖進行卷積獲得預測框的回歸(loc)和類別(cls)資訊,注意SSD將背景也看成類別了,所以對於VOC數據集類別數就是20+1=21。這部分的程式碼為:

def multibox(vgg, extra_layers, cfg, num_classes):      loc_layers = []#多尺度分支的回歸網路      conf_layers = []#多尺度分支的分類網路      # 第一部分,vgg 網路的 Conv2d-4_3(21層), Conv2d-7_1(-2層)      vgg_source = [21, -2]      for k, v in enumerate(vgg_source):          # 回歸 box*4(坐標)          loc_layers += [nn.Conv2d(vgg[v].out_channels,                                   cfg[k] * 4, kernel_size=3, padding=1)]          # 置信度 box*(num_classes)          conf_layers += [nn.Conv2d(vgg[v].out_channels,                          cfg[k] * num_classes, kernel_size=3, padding=1)]      # 第二部分,cfg從第三個開始作為box的個數,而且用於多尺度提取的網路分別為1,3,5,7層      for k, v in enumerate(extra_layers[1::2], 2):          loc_layers += [nn.Conv2d(v.out_channels, cfg[k]                                   * 4, kernel_size=3, padding=1)]          conf_layers += [nn.Conv2d(v.out_channels, cfg[k]                                    * num_classes, kernel_size=3, padding=1)]      return vgg, extra_layers, (loc_layers, conf_layers)  # 用下面的測試程式碼測試一下  if __name__  == "__main__":      vgg, extra_layers, (l, c) = multibox(vgg(base['300'], 3),                                           add_extras(extras['300'], 1024),                                           [4, 6, 6, 6, 4, 4], 21)      print(nn.Sequential(*l))      print('---------------------------')      print(nn.Sequential(*c))  

在jupter notebook輸出資訊為:

'''  loc layers:  '''  Sequential(    (0): Conv2d(512, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))    (1): Conv2d(1024, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))    (2): Conv2d(512, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))    (3): Conv2d(256, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))    (4): Conv2d(256, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))    (5): Conv2d(256, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))  )  ---------------------------  '''  conf layers:  '''  Sequential(    (0): Conv2d(512, 84, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))    (1): Conv2d(1024, 126, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))    (2): Conv2d(512, 126, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))    (3): Conv2d(256, 126, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))    (4): Conv2d(256, 84, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))    (5): Conv2d(256, 84, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))  )  

Anchor生成(Prior_Box層)

這個在前面SSD的原理篇中講過了,這裡不妨再回憶一下,SSD從魔改後的VGG16的conv4_3開始一共使用了6個不同大小的特徵圖,大小分別為(38,28),(19,19),(10,10),(5,5),(3,3),(1,1),但每個特徵圖上設置的先驗框(Anchor)的數量不同。先驗框的設置包含尺度和長寬比兩個方面。對於先驗框的設置,公式如下: ,其中指的是特徵圖個數,這裡為5,因為第一層conv4_3的Anchor是單獨設置的,代表先驗框大小相對於特徵圖的比例,注意這裡不是相對原圖哦。最後,和表示比例的最小值和最大值,論文中分別取和。對於第一個特徵圖,它的先驗框尺度比例設置為,則他的尺度為,後面的特徵圖帶入公式計算,並將其映射會原圖300的大小可以得到,剩下的5個特徵圖的尺度為。所以綜合起來,6個特徵圖的尺度為。有了Anchor的尺度,接下來設置Anchor的長寬,論文中長寬設置一般為,根據面積和長寬比可以得到先驗框的寬度和高度: 。這裡有一些值得注意的點,如下:

  • 上面的是相對於原圖的大小。
  • 默認情況下,每個特徵圖除了上面5個比例的Anchor,還會設置一個尺度為且的先驗框,這樣每個特徵圖都設置了兩個長寬比為1但大小不同的正方形先驗框。最後一個特徵圖需要參考一下來計算。
  • 在實現conv4_3,conv10_2,conv11_2層時僅使用4個先驗框,不使用長寬比為的Anchor。
  • 每個單元的先驗框中心點分布在每個單元的中心,即: ,其中是特徵圖的大小。

從Anchor的值來看,越前面的特徵圖Anchor的尺寸越小,也就是說對小目標的效果越好。先驗框的總數為num_priors = 38x38x4+19x19x6+10x10x6+5x5x6+3x3x4+1x1x4=8732

生成先驗框的程式碼如下(來自layers/functions/prior_box.py)

class PriorBox(object):      """Compute priorbox coordinates in center-offset form for each source      feature map.      """      def __init__(self, cfg):          super(PriorBox, self).__init__()          self.image_size = cfg['min_dim']          # number of priors for feature map location (either 4 or 6)          self.num_priors = len(cfg['aspect_ratios'])          self.variance = cfg['variance'] or [0.1]          self.feature_maps = cfg['feature_maps']          self.min_sizes = cfg['min_sizes']          self.max_sizes = cfg['max_sizes']          self.steps = cfg['steps']          self.aspect_ratios = cfg['aspect_ratios']          self.clip = cfg['clip']          self.version = cfg['name']          for v in self.variance:              if v <= 0:                  raise ValueError('Variances must be greater than 0')        def forward(self):          mean = []          # 遍歷多尺度的 特徵圖: [38, 19, 10, 5, 3, 1]          for k, f in enumerate(self.feature_maps):              # 遍歷每個像素              for i, j in product(range(f), repeat=2):                  # k-th 層的feature map 大小                  f_k = self.image_size / self.steps[k]                  # # 每個框的中心坐標                  cx = (j + 0.5) / f_k                  cy = (i + 0.5) / f_k                    # aspect_ratio: 1 當 ratio==1的時候,會產生兩個 box                  # r==1, size = s_k, 正方形                  s_k = self.min_sizes[k]/self.image_size                  mean += [cx, cy, s_k, s_k]                    # r==1, size = sqrt(s_k * s_(k+1)), 正方形                  # rel size: sqrt(s_k * s_(k+1))                  s_k_prime = sqrt(s_k * (self.max_sizes[k]/self.image_size))                  mean += [cx, cy, s_k_prime, s_k_prime]                    # 當 ratio != 1 的時候,產生的box為矩形                  for ar in self.aspect_ratios[k]:                      mean += [cx, cy, s_k*sqrt(ar), s_k/sqrt(ar)]                      mean += [cx, cy, s_k/sqrt(ar), s_k*sqrt(ar)]          # 轉化為 torch的Tensor          output = torch.Tensor(mean).view(-1, 4)          #歸一化,把輸出設置在 [0,1]          if self.clip:              output.clamp_(max=1, min=0)  return output  

網路結構

結合了前面介紹的魔改後的VGG16,還有Extra Layers,還有生成Anchor的Priobox策略,我們可以寫出SSD的整體結構如下(程式碼在ssd.py):

class SSD(nn.Module):      """Single Shot Multibox Architecture      The network is composed of a base VGG network followed by the      added multibox conv layers.  Each multibox layer branches into          1) conv2d for class conf scores          2) conv2d for localization predictions          3) associated priorbox layer to produce default bounding             boxes specific to the layer's feature map size.      See: https://arxiv.org/pdf/1512.02325.pdf for more details.      Args:          phase: (string) Can be "test" or "train"          size: input image size          base: VGG16 layers for input, size of either 300 or 500          extras: extra layers that feed to multibox loc and conf layers          head: "multibox head" consists of loc and conf conv layers      """        def __init__(self, phase, size, base, extras, head, num_classes):          super(SSD, self).__init__()          self.phase = phase          self.num_classes = num_classes          # 配置config          self.cfg = (coco, voc)[num_classes == 21]          # 初始化先驗框          self.priorbox = PriorBox(self.cfg)          self.priors = Variable(self.priorbox.forward(), volatile=True)          self.size = size            # SSD network          # backbone網路          self.vgg = nn.ModuleList(base)          # Layer learns to scale the l2 normalized features from conv4_3          # conv4_3後面的網路,L2 正則化          self.L2Norm = L2Norm(512, 20)          self.extras = nn.ModuleList(extras)          # 回歸和分類網路          self.loc = nn.ModuleList(head[0])          self.conf = nn.ModuleList(head[1])            if phase == 'test':              self.softmax = nn.Softmax(dim=-1)              self.detect = Detect(num_classes, 0, 200, 0.01, 0.45)        def forward(self, x):          """Applies network layers and ops on input image(s) x.          Args:              x: input image or batch of images. Shape: [batch,3,300,300].          Return:              Depending on phase:              test:                  Variable(tensor) of output class label predictions,                  confidence score, and corresponding location predictions for                  each object detected. Shape: [batch,topk,7]              train:                  list of concat outputs from:                      1: confidence layers, Shape: [batch*num_priors,num_classes]                      2: localization layers, Shape: [batch,num_priors*4]                      3: priorbox layers, Shape: [2,num_priors*4]          """          sources = list()          loc = list()          conf = list()            # apply vgg up to conv4_3 relu          # vgg網路到conv4_3          for k in range(23):              x = self.vgg[k](x)          # l2 正則化          s = self.L2Norm(x)          sources.append(s)            # apply vgg up to fc7          # conv4_3 到 fc          for k in range(23, len(self.vgg)):              x = self.vgg[k](x)          sources.append(x)            # apply extra layers and cache source layer outputs          # extras 網路          for k, v in enumerate(self.extras):              x = F.relu(v(x), inplace=True)              if k % 2 == 1:                  # 把需要進行多尺度的網路輸出存入 sources                  sources.append(x)            # apply multibox head to source layers          # 多尺度回歸和分類網路          for (x, l, c) in zip(sources, self.loc, self.conf):              loc.append(l(x).permute(0, 2, 3, 1).contiguous())              conf.append(c(x).permute(0, 2, 3, 1).contiguous())            loc = torch.cat([o.view(o.size(0), -1) for o in loc], 1)          conf = torch.cat([o.view(o.size(0), -1) for o in conf], 1)          if self.phase == "test":              output = self.detect(                  loc.view(loc.size(0), -1, 4),                   # loc preds                  self.softmax(conf.view(conf.size(0), -1,                               self.num_classes)),                # conf preds                  self.priors.type(type(x.data))                  # default boxes              )          else:              output = (                  # loc的輸出,size:(batch, 8732, 4)                  loc.view(loc.size(0), -1, 4),                  # conf的輸出,size:(batch, 8732, 21)                  conf.view(conf.size(0), -1, self.num_classes),                  # 生成所有的候選框 size([8732, 4])                  self.priors              )          return output      # 載入模型參數      def load_weights(self, base_file):          other, ext = os.path.splitext(base_file)          if ext == '.pkl' or '.pth':              print('Loading weights into state dict...')              self.load_state_dict(torch.load(base_file,                                   map_location=lambda storage, loc: storage))              print('Finished!')          else:  			print('Sorry only .pth and .pkl files supported.')  

然後為了增加可讀性,重新封裝了一下,程式碼如下:

base = {      '300': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'C', 512, 512, 512, 'M',              512, 512, 512],      '512': [],  }  extras = {      '300': [256, 'S', 512, 128, 'S', 256, 128, 256, 128, 256],      '512': [],  }  mbox = {      '300': [4, 6, 6, 6, 4, 4],  # number of boxes per feature map location      '512': [],  }      def build_ssd(phase, size=300, num_classes=21):      if phase != "test" and phase != "train":          print("ERROR: Phase: " + phase + " not recognized")          return      if size != 300:          print("ERROR: You specified size " + repr(size) + ". However, " +                "currently only SSD300 (size=300) is supported!")          return      # 調用multibox,生成vgg,extras,head      base_, extras_, head_ = multibox(vgg(base[str(size)], 3),                                       add_extras(extras[str(size)], 1024),                                       mbox[str(size)], num_classes)  	return SSD(phase, size, base_, extras_, head_, num_classes)  

Loss解析

SSD的損失函數包含兩個部分,一個是定位損失,一個是分類損失,整個損失函數表達如下: 其中,是先驗框的正樣本數量,是類別置信度預測值,是先驗框對應的邊界框預測值,是ground truth的位置參數,代表網路的預測值。對於位置損失,採用Smooth L1 Loss,位置資訊都是encode之後的數值,後面會講這個encode的過程。而對於分類損失,首先需要使用hard negtive mining將正負樣本按照1:3 的比例把負樣本抽樣出來,抽樣的方法是:針對所有batch的confidence,按照置信度誤差進行降序排列,取出前top_k個負樣本。損失函數可以用下圖表示:

實現步驟

  • Reshape所有batch中的conf,即程式碼中的batch_conf = conf_data.view(-1, self.num_classes),方便後續排序。
  • 置信度誤差越大,實際上就是預測背景的置信度越小。
  • 把所有conf進行logsoftmax處理(均為負值),預測的置信度越小,則logsoftmax越小,取絕對值,則|logsoftmax|越大,降序排列-logsoftmax,取前top_k的負樣本。其中,log_sum_exp函數的程式碼如下:
def log_sum_exp(x):      x_max = x.detach().max()      return torch.log(torch.sum(torch.exp(x-x_max), 1, keepdim=True))+x_max  

分類損失conf_logP函數如下:

conf_logP = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1))  

這樣計算的原因主要是為了增強logsoftmax損失的數值穩定性。放一張我的手推圖:

損失函數完整程式碼實現,來自layers/modules/multibox_loss.py

class MultiBoxLoss(nn.Module):      """SSD Weighted Loss Function      Compute Targets:          1) Produce Confidence Target Indices by matching  ground truth boxes             with (default) 'priorboxes' that have jaccard index > threshold parameter             (default threshold: 0.5).          2) Produce localization target by 'encoding' variance into offsets of ground             truth boxes and their matched  'priorboxes'.          3) Hard negative mining to filter the excessive number of negative examples             that comes with using a large number of default bounding boxes.             (default negative:positive ratio 3:1)      Objective Loss:          L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N          Where, Lconf is the CrossEntropy Loss and Lloc is the SmoothL1 Loss          weighted by α which is set to 1 by cross val.          Args:              c: class confidences,              l: predicted boxes,              g: ground truth boxes              N: number of matched default boxes          See: https://arxiv.org/pdf/1512.02325.pdf for more details.      """        def __init__(self, num_classes, overlap_thresh, prior_for_matching,                   bkg_label, neg_mining, neg_pos, neg_overlap, encode_target,                   use_gpu=True):          super(MultiBoxLoss, self).__init__()          self.use_gpu = use_gpu          self.num_classes = num_classes          self.threshold = overlap_thresh          self.background_label = bkg_label          self.encode_target = encode_target          self.use_prior_for_matching = prior_for_matching          self.do_neg_mining = neg_mining          self.negpos_ratio = neg_pos          self.neg_overlap = neg_overlap          self.variance = cfg['variance']        def forward(self, predictions, targets):          """Multibox Loss          Args:              predictions (tuple): A tuple containing loc preds, conf preds,              and prior boxes from SSD net.                  conf shape: torch.size(batch_size,num_priors,num_classes)                  loc shape: torch.size(batch_size,num_priors,4)                  priors shape: torch.size(num_priors,4)              targets (tensor): Ground truth boxes and labels for a batch,                  shape: [batch_size,num_objs,5] (last idx is the label).          """          loc_data, conf_data, priors = predictions          num = loc_data.size(0)# batch_size          priors = priors[:loc_data.size(1), :]          num_priors = (priors.size(0)) # 先驗框個數          num_classes = self.num_classes #類別數            # match priors (default boxes) and ground truth boxes          # 獲取匹配每個prior box的 ground truth          # 創建 loc_t 和 conf_t 保存真實box的位置和類別          loc_t = torch.Tensor(num, num_priors, 4)          conf_t = torch.LongTensor(num, num_priors)          for idx in range(num):              truths = targets[idx][:, :-1].data #ground truth box資訊              labels = targets[idx][:, -1].data # ground truth conf資訊              defaults = priors.data # priors的 box 資訊              # 匹配 ground truth              match(self.threshold, truths, defaults, self.variance, labels,                    loc_t, conf_t, idx)          if self.use_gpu:              loc_t = loc_t.cuda()              conf_t = conf_t.cuda()          # wrap targets          loc_t = Variable(loc_t, requires_grad=False)          conf_t = Variable(conf_t, requires_grad=False)          # 匹配中所有的正樣本mask,shape[b,M]          pos = conf_t > 0          num_pos = pos.sum(dim=1, keepdim=True)          # Localization Loss,使用 Smooth L1          # shape[b,M]-->shape[b,M,4]          pos_idx = pos.unsqueeze(pos.dim()).expand_as(loc_data)          loc_p = loc_data[pos_idx].view(-1, 4) #預測的正樣本box資訊          loc_t = loc_t[pos_idx].view(-1, 4) #真實的正樣本box資訊          loss_l = F.smooth_l1_loss(loc_p, loc_t, size_average=False) #Smooth L1 損失    		'''          Target;              下面進行hard negative mining          過程:              1、 針對所有batch的conf,按照置信度誤差(預測背景的置信度越小,誤差越大)進行降序排列;              2、 負樣本的label全是背景,那麼利用log softmax 計算出logP,                 logP越大,則背景概率越低,誤差越大;              3、 選取誤差交大的top_k作為負樣本,保證正負樣本比例接近1:3;          '''          # Compute max conf across batch for hard negative mining          # shape[b*M,num_classes]          batch_conf = conf_data.view(-1, self.num_classes)          # 使用logsoftmax,計算置信度,shape[b*M, 1]          loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1))            # Hard Negative Mining          loss_c[pos] = 0  # 把正樣本排除,剩下的就全是負樣本,可以進行抽樣          loss_c = loss_c.view(num, -1)# shape[b, M]          # 兩次sort排序,能夠得到每個元素在降序排列中的位置idx_rank          _, loss_idx = loss_c.sort(1, descending=True)          _, idx_rank = loss_idx.sort(1)           # 抽取負樣本          # 每個batch中正樣本的數目,shape[b,1]          num_pos = pos.long().sum(1, keepdim=True)          num_neg = torch.clamp(self.negpos_ratio*num_pos, max=pos.size(1)-1)          # 抽取前top_k個負樣本,shape[b, M]          neg = idx_rank < num_neg.expand_as(idx_rank)            # Confidence Loss Including Positive and Negative Examples          # shape[b,M] --> shape[b,M,num_classes]          pos_idx = pos.unsqueeze(2).expand_as(conf_data)          neg_idx = neg.unsqueeze(2).expand_as(conf_data)          # 提取出所有篩選好的正負樣本(預測的和真實的)          conf_p = conf_data[(pos_idx+neg_idx).gt(0)].view(-1, self.num_classes)          targets_weighted = conf_t[(pos+neg).gt(0)]          # 計算conf交叉熵          loss_c = F.cross_entropy(conf_p, targets_weighted, size_average=False)            # Sum of losses: L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N  		# 正樣本個數          N = num_pos.data.sum()          loss_l /= N          loss_c /= N  		return loss_l, loss_c  

先驗框匹配策略

上面的程式碼中還有一個地方沒講到,就是match函數。這是SSD演算法的先驗框匹配函數。在訓練時首先需要確定訓練圖片中的ground truth是由哪一個先驗框來匹配,與之匹配的先驗框所對應的邊界框將負責預測它。SSD的先驗框和ground truth匹配原則主要有2點。第一點是對於圖片中的每個ground truth,找到和它IOU最大的先驗框,該先驗框與其匹配,這樣可以保證每個ground truth一定與某個prior匹配。第二點是對於剩餘的未匹配的先驗框,若某個ground truth和它的IOU大於某個閾值(一般設為0.5),那麼改prior和這個ground truth,剩下沒有匹配上的先驗框都是負樣本(如果多個ground truth和某一個先驗框的IOU均大於閾值,那麼prior只與IOU最大的那個進行匹配)。程式碼實現如下,來自layers/box_utils.py

def match(threshold, truths, priors, variances, labels, loc_t, conf_t, idx):      """把和每個prior box 有最大的IOU的ground truth box進行匹配,      同時,編碼包圍框,返回匹配的索引,對應的置信度和位置      Args:          threshold: IOU閾值,小於閾值設為背景          truths: ground truth boxes, shape[N,4]          priors: 先驗框, shape[M,4]          variances: prior的方差, list(float)          labels: 圖片的所有類別,shape[num_obj]          loc_t: 用於填充encoded loc 目標張量          conf_t: 用於填充encoded conf 目標張量          idx: 現在的batch index          The matched indices corresponding to 1)location and 2)confidence preds.      """      # jaccard index      # 計算IOU      overlaps = jaccard(          truths,          point_form(priors)      )      # (Bipartite Matching)      # [1,num_objects] 和每個ground truth box 交集最大的 prior box      best_prior_overlap, best_prior_idx = overlaps.max(1, keepdim=True)      # [1,num_priors] 和每個prior box 交集最大的 ground truth box      best_truth_overlap, best_truth_idx = overlaps.max(0, keepdim=True)      best_truth_idx.squeeze_(0) #M      best_truth_overlap.squeeze_(0) #M      best_prior_idx.squeeze_(1) #N      best_prior_overlap.squeeze_(1) #N      # 保證每個ground truth box 與某一個prior box 匹配,固定值為 2 > threshold      best_truth_overlap.index_fill_(0, best_prior_idx, 2)  # ensure best prior      # TODO refactor: index  best_prior_idx with long tensor      # ensure every gt matches with its prior of max overlap      # 保證每一個ground truth 匹配它的都是具有最大IOU的prior      # 根據 best_prior_dix 鎖定 best_truth_idx裡面的最大IOU prior      for j in range(best_prior_idx.size(0)):          best_truth_idx[best_prior_idx[j]] = j      matches = truths[best_truth_idx]          # 提取出所有匹配的ground truth box, Shape: [M,4]      conf = labels[best_truth_idx] + 1         # 提取出所有GT框的類別, Shape:[M]      # 把 iou < threshold 的框類別設置為 bg,即為0      conf[best_truth_overlap < threshold] = 0  # label as background      # 編碼包圍框      loc = encode(matches, priors, variances)      # 保存匹配好的loc和conf到loc_t和conf_t中      loc_t[idx] = loc    # [num_priors,4] encoded offsets to learn      conf_t[idx] = conf  # [num_priors] top class label for each prior  

位置坐標轉換

我們看到上面出現了一個point_form函數,這是什麼意思呢?這是因為目標框有2種表示方式:

  • 這部分的程式碼在layers/box_utils.py下:
def point_form(boxes):      """ Convert prior_boxes to (xmin, ymin, xmax, ymax)     把 prior_box (cx, cy, w, h)轉化為(xmin, ymin, xmax, ymax)      """      return torch.cat((boxes[:, :2] - boxes[:, 2:]/2,     # xmin, ymin                       boxes[:, :2] + boxes[:, 2:]/2), 1)  # xmax, ymax      def center_size(boxes):      """ Convert prior_boxes to (cx, cy, w, h)      把 prior_box (xmin, ymin, xmax, ymax) 轉化為 (cx, cy, w, h)      """      return torch.cat((boxes[:, 2:] + boxes[:, :2])/2,  # cx, cy                              boxes[:, 2:] - boxes[:, :2], 1) # w, h  

IOU計算

這部分比較簡單,對於兩個Box來講,首先計算兩個box左上角點坐標的最大值和右下角坐標的最小值,然後計算交集面積,最後把交集面積除以對應的並集面積。程式碼仍在layers/box_utils.py

def intersect(box_a, box_b):      """ We resize both tensors to [A,B,2] without new malloc:      [A,2] -> [A,1,2] -> [A,B,2]      [B,2] -> [1,B,2] -> [A,B,2]      Then we compute the area of intersect between box_a and box_b.      Args:        box_a: (tensor) bounding boxes, Shape: [A,4].        box_b: (tensor) bounding boxes, Shape: [B,4].      Return:        (tensor) intersection area, Shape: [A,B].      """      A = box_a.size(0)      B = box_b.size(0)       # 右下角,選出最小值      max_xy = torch.min(box_a[:, 2:].unsqueeze(1).expand(A, B, 2),                         box_b[:, 2:].unsqueeze(0).expand(A, B, 2))      # 左上角,選出最大值      min_xy = torch.max(box_a[:, :2].unsqueeze(1).expand(A, B, 2),                         box_b[:, :2].unsqueeze(0).expand(A, B, 2))      # 負數用0截斷,為0代表交集為0      inter = torch.clamp((max_xy - min_xy), min=0)      return inter[:, :, 0] * inter[:, :, 1]      def jaccard(box_a, box_b):      """Compute the jaccard overlap of two sets of boxes.  The jaccard overlap      is simply the intersection over union of two boxes.  Here we operate on      ground truth boxes and default boxes.      E.g.:          A ∩ B / A ∪ B = A ∩ B / (area(A) + area(B) - A ∩ B)      Args:          box_a: (tensor) Ground truth bounding boxes, Shape: [num_objects,4]          box_b: (tensor) Prior boxes from priorbox layers, Shape: [num_priors,4]      Return:          jaccard overlap: (tensor) Shape: [box_a.size(0), box_b.size(0)]      """      inter = intersect(box_a, box_b)# A∩B       # box_a和box_b的面積      area_a = ((box_a[:, 2]-box_a[:, 0]) *                (box_a[:, 3]-box_a[:, 1])).unsqueeze(1).expand_as(inter)  # [A,B]#(N,)      area_b = ((box_b[:, 2]-box_b[:, 0]) *                (box_b[:, 3]-box_b[:, 1])).unsqueeze(0).expand_as(inter)  # [A,B]#(M,)      union = area_a + area_b - inter      return inter / union  # [A,B]    

L2標準化

VGG16的conv4_3特徵圖的大小為,網路層靠前,方差比較大,需要加一個L2標準化,以保證和後面的檢測層差異不是很大。L2標準化的公式如下: ,其中。同時,這裡還要注意的是如果簡單的對一個layer的輸入進行L2標準化就會改變該層的規模,並且會減慢學習速度,因此這裡引入了一個縮放係數 ,對於每一個通道l2標準化後的結果為: ,通常的值設10或者20,效果比較好。程式碼來自layers/modules/l2norm.py。

class L2Norm(nn.Module):      '''      conv4_3特徵圖大小38x38,網路層靠前,norm較大,需要加一個L2 Normalization,以保證和後面的檢測層差異不是很大,具體可以參考:ParseNet。這個前面的推文裡面有講。      '''      def __init__(self, n_channels, scale):          super(L2Norm, self).__init__()          self.n_channels = n_channels          self.gamma = scale or None          self.eps = 1e-10          # 將一個不可訓練的類型Tensor轉換成可以訓練的類型 parameter          self.weight = nn.Parameter(torch.Tensor(self.n_channels))          self.reset_parameters()        # 初始化參數      def reset_parameters(self):          nn.init.constant_(self.weight, self.gamma)        def forward(self, x):          # 計算x的2範數          norm = x.pow(2).sum(dim=1, keepdim=True).sqrt() # shape[b,1,38,38]          x = x / norm   # shape[b,512,38,38]            # 擴展self.weight的維度為shape[1,512,1,1],然後參考公式計算          out = self.weight[None,...,None,None] * x          return out  

位置資訊編解碼

上面提到了計算坐標損失的時候,坐標是encoding之後的,這是怎麼回事呢?根據論文的描述,預測框和ground truth邊界框存在一個轉換關係,先定義一些變數:

  • 先驗框位置:
  • ground truth框位置:
  • variance是先驗框的坐標方差。然後編碼的過程可以表示為:

解碼的過程可以表示為:

這部分對應的程式碼在layers/box_utils.py裡面:

def encode(matched, priors, variances):      """Encode the variances from the priorbox layers into the ground truth boxes      we have matched (based on jaccard overlap) with the prior boxes.      Args:          matched: (tensor) Coords of ground truth for each prior in point-form              Shape: [num_priors, 4].          priors: (tensor) Prior boxes in center-offset form              Shape: [num_priors,4].          variances: (list[float]) Variances of priorboxes      Return:          encoded boxes (tensor), Shape: [num_priors, 4]      """        # dist b/t match center and prior's center      g_cxcy = (matched[:, :2] + matched[:, 2:])/2 - priors[:, :2]      # encode variance      g_cxcy /= (variances[0] * priors[:, 2:])      # match wh / prior wh      g_wh = (matched[:, 2:] - matched[:, :2]) / priors[:, 2:]      g_wh = torch.log(g_wh) / variances[1]      # return target for smooth_l1_loss      return torch.cat([g_cxcy, g_wh], 1)  # [num_priors,4]      # Adapted from https://github.com/Hakuyume/chainer-ssd  def decode(loc, priors, variances):      """Decode locations from predictions using priors to undo      the encoding we did for offset regression at train time.      Args:          loc (tensor): location predictions for loc layers,              Shape: [num_priors,4]          priors (tensor): Prior boxes in center-offset form.              Shape: [num_priors,4].          variances: (list[float]) Variances of priorboxes      Return:          decoded bounding box predictions      """        boxes = torch.cat((          priors[:, :2] + loc[:, :2] * variances[0] * priors[:, 2:],          priors[:, 2:] * torch.exp(loc[:, 2:] * variances[1])), 1)      boxes[:, :2] -= boxes[:, 2:] / 2      boxes[:, 2:] += boxes[:, :2]  return boxes  

後處理NMS

這部分我在上周的推文講過原理了,這裡不再贅述了。這裡IOU閾值取了0.5。不了解原理可以去看一下我的那篇推文,也給了源碼講解,地址是:https://mp.weixin.qq.com/s/orYMdwZ1VwwIScPmIiq5iA 。這部分的程式碼也在layers/box_utils.py裡面。就不再拿程式碼來贅述了。

檢測函數

模型在測試的時候,需要把loc和conf輸入到detect函數進行nms,然後給出結果。這部分的程式碼在layers/functions/detection.py裡面,如下:

class Detect(Function):      """At test time, Detect is the final layer of SSD.  Decode location preds,      apply non-maximum suppression to location predictions based on conf      scores and threshold to a top_k number of output predictions for both      confidence score and locations.      """      def __init__(self, num_classes, bkg_label, top_k, conf_thresh, nms_thresh):          self.num_classes = num_classes          self.background_label = bkg_label          self.top_k = top_k          # Parameters used in nms.          self.nms_thresh = nms_thresh          if nms_thresh <= 0:              raise ValueError('nms_threshold must be non negative.')          self.conf_thresh = conf_thresh          self.variance = cfg['variance']        def forward(self, loc_data, conf_data, prior_data):          """          Args:              loc_data: 預測出的loc張量,shape[b,M,4], eg:[b, 8732, 4]              conf_data:預測出的置信度,shape[b,M,num_classes], eg:[b, 8732, 21]              prior_data:先驗框,shape[M,4], eg:[8732, 4]          """          num = loc_data.size(0)  # batch size          num_priors = prior_data.size(0)          output = torch.zeros(num, self.num_classes, self.top_k, 5)# 初始化輸出          conf_preds = conf_data.view(num, num_priors,                                      self.num_classes).transpose(2, 1)            # 解碼loc的資訊,變為正常的bboxes          for i in range(num):              # 解碼loc              decoded_boxes = decode(loc_data[i], prior_data, self.variance)              # 拷貝每個batch內的conf,用於nms              conf_scores = conf_preds[i].clone()              # 遍歷每一個類別              for cl in range(1, self.num_classes):                  # 篩選掉 conf < conf_thresh 的conf                  c_mask = conf_scores[cl].gt(self.conf_thresh)                  scores = conf_scores[cl][c_mask]                  # 如果都被篩掉了,則跳入下一類                  if scores.size(0) == 0:                      continue                  # 篩選掉 conf < conf_thresh 的框                  l_mask = c_mask.unsqueeze(1).expand_as(decoded_boxes)                  boxes = decoded_boxes[l_mask].view(-1, 4)                  # idx of highest scoring and non-overlapping boxes per class                  # nms                  ids, count = nms(boxes, scores, self.nms_thresh, self.top_k)                  # nms 後得到的輸出拼接                  output[i, cl, :count] =                       torch.cat((scores[ids[:count]].unsqueeze(1),                                 boxes[ids[:count]]), 1)          flt = output.contiguous().view(num, -1, 5)          _, idx = flt[:, :, 0].sort(1, descending=True)          _, rank = idx.sort(1)          flt[(rank < self.top_k).unsqueeze(-1).expand_as(flt)].fill_(0)  	return output  

後記

SSD的核心程式碼解析大概就到這裡了,我覺得這個過程演算法還算比較清晰了,不過SSD能夠表現較好的原因還和它的多種有效的數據增強方式有關,之後我們有機會再來解析一下他的數據增強策略。本文寫作的目錄參考了知乎https://zhuanlan.zhihu.com/p/79854543,看程式碼和寫作以及理解一些細節大概花了一周時間,看到這裡的同學不妨給我點個贊吧。


歡迎關注我的微信公眾號GiantPadaCV,期待和你一起交流機器學習,深度學習,影像演算法,優化技術,比賽及日常生活等。