深度學習算法優化系列八 | VGG,ResNet,DenseNe模型剪枝代碼實戰

  • 2020 年 2 月 12 日
  • 筆記

前言

具體原理已經講過了,見上回的推文。深度學習算法優化系列七 | ICCV 2017的一篇模型剪枝論文,也是2019年眾多開源剪枝項目的理論基礎 。這篇文章是從源碼實戰的角度來解釋模型剪枝,源碼來自:https://github.com/Eric-mingjie/network-slimming 。我這裡主要是結合源碼來分析每個模型的具體剪枝過程,希望能給你剪枝自己的模型一些啟發。

稀疏訓練

論文的想法是對於每一個通道都引入一個縮放因子,然後和通道的輸出相乘。接着聯合訓練網絡權重和這些縮放因子,最後將小縮放因子的通道直接移除,微調剪枝後的網絡,特別地,目標函數被定義為:

在這裡插入圖片描述

其中代表訓練數據和標籤,是網絡的可訓練參數,第一項是CNN的訓練損失函數。是在縮放因子上的乘法項,是兩項的平衡因子。論文的實驗過程中選擇,即正則化,這也被廣泛的應用於稀疏化。次梯度下降法作為不平滑(不可導)的L1懲罰項的優化方法,另一個建議是使用平滑的L1正則項取代L1懲罰項,盡量避免在不平滑的點使用次梯度。

main.py的實現中支持了稀疏訓練,其中下面這行代碼即添加了稀疏訓練的懲罰係數,注意是作用在BN層的縮放係數上的:

parser.add_argument('--s', type=float, default=0.0001,  help='scale sparse rate (default: 0.0001)')  

因此BN層的更新也要相應的加上懲罰項,代碼如下:

def updateBN():      for m in model.modules():          if isinstance(m, nn.BatchNorm2d):  	       m.weight.grad.data.add_(args.s*torch.sign(m.weight.data)) # L1  

最後訓練,測試,保存Basline模型(包含VGG16,Resnet-164,DenseNet40)的代碼如下,代碼很常規就不過多解釋這一節了:

def train(epoch):      model.train()      for batch_idx, (data, target) in enumerate(train_loader):          if args.cuda:              data, target = data.cuda(), target.cuda()          data, target = Variable(data), Variable(target)          optimizer.zero_grad()          output = model(data)          loss = F.cross_entropy(output, target)          pred = output.data.max(1, keepdim=True)[1]          loss.backward()          if args.sr:              updateBN()          optimizer.step()          if batch_idx % args.log_interval == 0:              print('Train Epoch: {} [{}/{} ({:.1f}%)]tLoss: {:.6f}'.format(                  epoch, batch_idx * len(data), len(train_loader.dataset),                  100. * batch_idx / len(train_loader), loss.data[0]))    def test():      model.eval()      test_loss = 0      correct = 0      for data, target in test_loader:          if args.cuda:              data, target = data.cuda(), target.cuda()          data, target = Variable(data, volatile=True), Variable(target)          output = model(data)          test_loss += F.cross_entropy(output, target, size_average=False).data[0] # sum up batch loss          pred = output.data.max(1, keepdim=True)[1] # get the index of the max log-probability          correct += pred.eq(target.data.view_as(pred)).cpu().sum()        test_loss /= len(test_loader.dataset)      print('nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.1f}%)n'.format(          test_loss, correct, len(test_loader.dataset),          100. * correct / len(test_loader.dataset)))      return correct / float(len(test_loader.dataset))    def save_checkpoint(state, is_best, filepath):      torch.save(state, os.path.join(filepath, 'checkpoint.pth.tar'))      if is_best:          shutil.copyfile(os.path.join(filepath, 'checkpoint.pth.tar'), os.path.join(filepath, 'model_best.pth.tar'))    best_prec1 = 0.  for epoch in range(args.start_epoch, args.epochs):      if epoch in [args.epochs*0.5, args.epochs*0.75]:          for param_group in optimizer.param_groups:              param_group['lr'] *= 0.1      train(epoch)      prec1 = test()      is_best = prec1 > best_prec1      best_prec1 = max(prec1, best_prec1)      save_checkpoint({          'epoch': epoch + 1,          'state_dict': model.state_dict(),          'best_prec1': best_prec1,          'optimizer': optimizer.state_dict(),      }, is_best, filepath=args.save)    print("Best accuracy: "+str(best_prec1))  

VGG16的剪枝

代碼為工程目錄下的vggprune.py。剪枝的具體步驟如下:

模型加載

加載需要剪枝的模型,也即是稀疏訓練得到的BaseLine模型,代碼如下,其中args.depth用於指定VGG模型的深度,一般為1619

model = vgg(dataset=args.dataset, depth=args.depth)  if args.cuda:      model.cuda()    if args.model:      if os.path.isfile(args.model):          print("=> loading checkpoint '{}'".format(args.model))          checkpoint = torch.load(args.model)          args.start_epoch = checkpoint['epoch']          best_prec1 = checkpoint['best_prec1']          model.load_state_dict(checkpoint['state_dict'])          print("=> loaded checkpoint '{}' (epoch {}) Prec1: {:f}"                .format(args.model, checkpoint['epoch'], best_prec1))      else:          print("=> no checkpoint found at '{}'".format(args.resume))    print(model)  

預剪枝

首先確定剪枝的全局閾值,然後根據閾值得到剪枝後的網絡每層的通道數cfg_mask,這個cfg_mask就可以確定我們剪枝後的模型的結構了,注意這個過程只是確定每一層那一些索引的通道要被剪枝掉並獲得cfg_mask,還沒有真正的執行剪枝操作。我給代碼加了部分注釋,應該不難懂。

# 計算需要剪枝的變量個數total  total = 0  for m in model.modules():      if isinstance(m, nn.BatchNorm2d):          total += m.weight.data.shape[0]    # 確定剪枝的全局閾值  bn = torch.zeros(total)  index = 0  for m in model.modules():      if isinstance(m, nn.BatchNorm2d):          size = m.weight.data.shape[0]          bn[index:(index+size)] = m.weight.data.abs().clone()          index += size  # 按照權值大小排序  y, i = torch.sort(bn)  thre_index = int(total * args.percent)  # 確定要剪枝的閾值  thre = y[thre_index]  #********************************預剪枝*********************************#  pruned = 0  cfg = []  cfg_mask = []  for k, m in enumerate(model.modules()):      if isinstance(m, nn.BatchNorm2d):          weight_copy = m.weight.data.abs().clone()          # 要保留的通道標記Mask圖          mask = weight_copy.gt(thre).float().cuda()          # 剪枝掉的通道數個數          pruned = pruned + mask.shape[0] - torch.sum(mask)          m.weight.data.mul_(mask)          m.bias.data.mul_(mask)          cfg.append(int(torch.sum(mask)))          cfg_mask.append(mask.clone())          print('layer index: {:d} t total channel: {:d} t remaining channel: {:d}'.              format(k, mask.shape[0], int(torch.sum(mask))))      elif isinstance(m, nn.MaxPool2d):          cfg.append('M')    pruned_ratio = pruned/total    print('Pre-processing Successful!')  

對預剪枝後的模型進行測試

沒什麼好說的,看一下我的代碼注釋好啦。

# simple test model after Pre-processing prune (simple set BN scales to zeros)  #********************************預剪枝後model測試*********************************#  def test(model):      kwargs = {'num_workers': 1, 'pin_memory': True} if args.cuda else {}      # 加載測試數據      if args.dataset == 'cifar10':          test_loader = torch.utils.data.DataLoader(              datasets.CIFAR10('./data.cifar10', train=False, transform=transforms.Compose([                  transforms.ToTensor(),                  # 對R, G,B通道應該減的均值                  transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))])),              batch_size=args.test_batch_size, shuffle=True, **kwargs)      elif args.dataset == 'cifar100':          test_loader = torch.utils.data.DataLoader(              datasets.CIFAR100('./data.cifar100', train=False, transform=transforms.Compose([                  transforms.ToTensor(),                  transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))])),              batch_size=args.test_batch_size, shuffle=True, **kwargs)      else:          raise ValueError("No valid dataset is given.")      model.eval()      correct = 0      for data, target in test_loader:          if args.cuda:              data, target = data.cuda(), target.cuda()          data, target = Variable(data, volatile=True), Variable(target)          output = model(data)          pred = output.data.max(1, keepdim=True)[1] # get the index of the max log-probability          # 記錄類別預測正確的個數          correct += pred.eq(target.data.view_as(pred)).cpu().sum()        print('nTest set: Accuracy: {}/{} ({:.1f}%)n'.format(          correct, len(test_loader.dataset), 100. * correct / len(test_loader.dataset)))      return correct / float(len(test_loader.dataset))    acc = test(model)  

正式剪枝

在預剪枝之後我們獲得了每一個特徵圖需要剪掉哪些通道數的索引列表,接下來我們就可以按照這個列表執行剪枝操作了。剪枝的完整代碼如下:

# 定義原始模型和新模型的每一層保留通道索引的mask  start_mask = torch.ones(3)  end_mask = cfg_mask[layer_id_in_cfg]  for [m0, m1] in zip(model.modules(), newmodel.modules()):      # 對BN層和ConV層都要剪枝      if isinstance(m0, nn.BatchNorm2d):          # np.squeeze 從數組的形狀中刪除單維度條目,即把shape中為1的維度去掉          # np.argwhere(a) 返回非0的數組元組的索引,其中a是要索引數組的條件。          idx1 = np.squeeze(np.argwhere(np.asarray(end_mask.cpu().numpy())))          # 如果維度是1,那麼就新增一維,這是為了和BN層的weight的維度匹配          if idx1.size == 1:              idx1 = np.resize(idx1,(1,))          m1.weight.data = m0.weight.data[idx1.tolist()].clone()          m1.bias.data = m0.bias.data[idx1.tolist()].clone()          m1.running_mean = m0.running_mean[idx1.tolist()].clone()          m1.running_var = m0.running_var[idx1.tolist()].clone()          layer_id_in_cfg += 1          # 注意start_mask在end_mask的前一層,這個會在裁剪Conv2d的時候用到          start_mask = end_mask.clone()          if layer_id_in_cfg < len(cfg_mask):  # do not change in Final FC              end_mask = cfg_mask[layer_id_in_cfg]      elif isinstance(m0, nn.Conv2d):          idx0 = np.squeeze(np.argwhere(np.asarray(start_mask.cpu().numpy())))          idx1 = np.squeeze(np.argwhere(np.asarray(end_mask.cpu().numpy())))          print('In shape: {:d}, Out shape {:d}.'.format(idx0.size, idx1.size))          if idx0.size == 1:              idx0 = np.resize(idx0, (1,))          if idx1.size == 1:              idx1 = np.resize(idx1, (1,))          # 注意卷積核Tensor維度為[n, c, w, h],兩個卷積層連接,下一層的輸入維度n就等於當前層的c          w1 = m0.weight.data[:, idx0.tolist(), :, :].clone()          w1 = w1[idx1.tolist(), :, :, :].clone()          m1.weight.data = w1.clone()      elif isinstance(m0, nn.Linear):          # 注意卷積核Tensor維度為[n, c, w, h],兩個卷積層連接,下一層的輸入維度n'就等於當前層的c          idx0 = np.squeeze(np.argwhere(np.asarray(start_mask.cpu().numpy())))          if idx0.size == 1:              idx0 = np.resize(idx0, (1,))          m1.weight.data = m0.weight.data[:, idx0].clone()          m1.bias.data = m0.bias.data.clone()    torch.save({'cfg': cfg, 'state_dict': newmodel.state_dict()}, os.path.join(args.save, 'pruned.pth.tar'))    print(newmodel)  model = newmodel  test(model)  

到這裡VGG16就被剪枝完了,剪枝完成後我們還需要對這個新模型進行Retrain,仍然是使用main.py即可,參數改一下,命令如下:

python main.py --refine [PATH TO THE PRUNED MODEL] --dataset cifar10 --arch vgg --depth 16 --epochs 160  

這樣就可以獲得最終的模型了,VGG16在CIFAR10/100上剪枝並Retrain後最終的測試結果為:

結果相當優秀了,剪枝Retrain之後精度更高了。

ResNet的剪枝

深度學習算法優化系列七 | ICCV 2017的一篇模型剪枝論文,也是2019年眾多開源剪枝項目的理論基礎 提到對於ResNet和DenseNet這種每一層的輸出會作為後續多個層的輸入,且其BN層是在卷積層之前,在這種情況下,稀疏化是在層的輸入末端得到的,一個層選擇性的接受所有通道的子集去做下一步的卷積運算。為了在測試時節省參數和運行時間,需要放置一個通道選擇層鑒別出重要的通道。再通俗的解釋一下通道鑒別層的作用吧,對於ResNet的BN層來講,如果這個BN層後面放置了通道鑒別層就不需要做剪枝了,通道鑒別層都是放在每一個殘差模塊的第一個BN層後面以及整個網絡的最後一個BN層後面,這是因為這幾個層的輸入不僅僅和一個層相關還和多個層相關。所以為了保持網絡的泛化能力,這幾個BN層不剪枝,只剪枝其他的BN層。

設置通道鑒別層

通道鑒別層的代碼在models/channel_selection.py中,如下:

class channel_selection(nn.Module):      """      從BN層的輸出中選擇通道。它應該直接放在BN層之後,此層的輸出形狀由self.indexes中的1的個數決定      """      def __init__(self, num_channels):          """          使用長度和通道數相同的全1向量初始化"indexes", 剪枝過程中,將要剪枝的通道對應的indexes位置設為0          """          super(channel_selection, self).__init__()          self.indexes = nn.Parameter(torch.ones(num_channels))        def forward(self, input_tensor):          """          參數:          輸入Tensor維度: (N,C,H,W),這也是BN層的輸出Tensor          """          selected_index = np.squeeze(np.argwhere(self.indexes.data.cpu().numpy()))          if selected_index.size == 1:              selected_index = np.resize(selected_index, (1,))          output = input_tensor[:, selected_index, :, :]          return output  

將通道鑒別層放入ResNet

將通道鑒別層按照前面介紹的方法放入ResNet中,代碼在models/presnet.py中,如下注釋部分是在原始的ResNet 部分BN層後面放入了通道鑒別層,其他都和原始模型一樣。代碼如下:

class Bottleneck(nn.Module):      expansion = 4        def __init__(self, inplanes, planes, cfg, stride=1, downsample=None):          super(Bottleneck, self).__init__()          self.bn1 = nn.BatchNorm2d(inplanes)          # 新增的通道鑒別層,放在BN之後          self.select = channel_selection(inplanes)          self.conv1 = nn.Conv2d(cfg[0], cfg[1], kernel_size=1, bias=False)          self.bn2 = nn.BatchNorm2d(cfg[1])          self.conv2 = nn.Conv2d(cfg[1], cfg[2], kernel_size=3, stride=stride,                                 padding=1, bias=False)          self.bn3 = nn.BatchNorm2d(cfg[2])          self.conv3 = nn.Conv2d(cfg[2], planes * 4, kernel_size=1, bias=False)          self.relu = nn.ReLU(inplace=True)          self.downsample = downsample          self.stride = stride        def forward(self, x):          residual = x            out = self.bn1(x)          out = self.select(out)          out = self.relu(out)          out = self.conv1(out)            out = self.bn2(out)          out = self.relu(out)          out = self.conv2(out)            out = self.bn3(out)          out = self.relu(out)          out = self.conv3(out)            if self.downsample is not None:              residual = self.downsample(x)            out += residual            return out    class resnet(nn.Module):      def __init__(self, depth=164, dataset='cifar10', cfg=None):          super(resnet, self).__init__()          assert (depth - 2) % 9 == 0, 'depth should be 9n+2'            n = (depth - 2) // 9          block = Bottleneck            if cfg is None:              # Construct config variable.              cfg = [[16, 16, 16], [64, 16, 16]*(n-1), [64, 32, 32], [128, 32, 32]*(n-1), [128, 64, 64], [256, 64, 64]*(n-1), [256]]              cfg = [item for sub_list in cfg for item in sub_list]            self.inplanes = 16            self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1,                                 bias=False)          self.layer1 = self._make_layer(block, 16, n, cfg = cfg[0:3*n])          self.layer2 = self._make_layer(block, 32, n, cfg = cfg[3*n:6*n], stride=2)          self.layer3 = self._make_layer(block, 64, n, cfg = cfg[6*n:9*n], stride=2)          self.bn = nn.BatchNorm2d(64 * block.expansion)          # 新增的通道鑒別層,放在BN之後          self.select = channel_selection(64 * block.expansion)          self.relu = nn.ReLU(inplace=True)          self.avgpool = nn.AvgPool2d(8)            if dataset == 'cifar10':              self.fc = nn.Linear(cfg[-1], 10)          elif dataset == 'cifar100':              self.fc = nn.Linear(cfg[-1], 100)            for m in self.modules():              if isinstance(m, nn.Conv2d):                  n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels                  m.weight.data.normal_(0, math.sqrt(2. / n))              elif isinstance(m, nn.BatchNorm2d):                  m.weight.data.fill_(0.5)                  m.bias.data.zero_()        def _make_layer(self, block, planes, blocks, cfg, stride=1):          downsample = None          if stride != 1 or self.inplanes != planes * block.expansion:              downsample = nn.Sequential(                  nn.Conv2d(self.inplanes, planes * block.expansion,                            kernel_size=1, stride=stride, bias=False),              )            layers = []          layers.append(block(self.inplanes, planes, cfg[0:3], stride, downsample))          self.inplanes = planes * block.expansion          for i in range(1, blocks):              layers.append(block(self.inplanes, planes, cfg[3*i: 3*(i+1)]))            return nn.Sequential(*layers)        def forward(self, x):          x = self.conv1(x)            x = self.layer1(x)  # 32x32          x = self.layer2(x)  # 16x16          x = self.layer3(x)  # 8x8          x = self.bn(x)          x = self.select(x)          x = self.relu(x)            x = self.avgpool(x)          x = x.view(x.size(0), -1)          x = self.fc(x)            return x  

對Resnet進行剪枝

和VGGNet幾乎一致,只關注一個核心改變之處,就是正式剪枝的函數多了一點,這部分代碼在根目錄下的resprune.py中,我貼一下相比於VGG16的變化之處的代碼,也就是正式剪枝時的代碼,有注釋,不難:

for layer_id in range(len(old_modules)):      m0 = old_modules[layer_id]      m1 = new_modules[layer_id]      # 對BN層和ConV層都要剪枝      if isinstance(m0, nn.BatchNorm2d):          # np.squeeze 從數組的形狀中刪除單維度條目,即把shape中為1的維度去掉          # np.argwhere(a) 返回非0的數組元組的索引,其中a是要索引數組的條件。          idx1 = np.squeeze(np.argwhere(np.asarray(end_mask.cpu().numpy())))          # 如果維度是1,那麼就新增一維,這是為了和BN層的weight的維度匹配          if idx1.size == 1:              idx1 = np.resize(idx1,(1,))          # 如果下一層是通道選擇層,這個是ResNet和VGG剪枝的唯一不同之處          if isinstance(old_modules[layer_id + 1], channel_selection):              # 如果下一層是通道選擇層,這一層就不剪枝              m1.weight.data = m0.weight.data.clone()              m1.bias.data = m0.bias.data.clone()              m1.running_mean = m0.running_mean.clone()              m1.running_var = m0.running_var.clone()                # We need to set the channel selection layer.              m2 = new_modules[layer_id + 1]              m2.indexes.data.zero_()              m2.indexes.data[idx1.tolist()] = 1.0                layer_id_in_cfg += 1              start_mask = end_mask.clone()              if layer_id_in_cfg < len(cfg_mask):                  end_mask = cfg_mask[layer_id_in_cfg]          else:              # 否則正常剪枝              m1.weight.data = m0.weight.data[idx1.tolist()].clone()              m1.bias.data = m0.bias.data[idx1.tolist()].clone()              m1.running_mean = m0.running_mean[idx1.tolist()].clone()              m1.running_var = m0.running_var[idx1.tolist()].clone()              layer_id_in_cfg += 1              start_mask = end_mask.clone()              if layer_id_in_cfg < len(cfg_mask):  # do not change in Final FC                  end_mask = cfg_mask[layer_id_in_cfg]      elif isinstance(m0, nn.Conv2d):          if conv_count == 0:              m1.weight.data = m0.weight.data.clone()              conv_count += 1              continue          # 正常剪枝就好          if isinstance(old_modules[layer_id-1], channel_selection) or isinstance(old_modules[layer_id-1], nn.BatchNorm2d):              # This convers the convolutions in the residual block.              # The convolutions are either after the channel selection layer or after the batch normalization layer.              conv_count += 1              idx0 = np.squeeze(np.argwhere(np.asarray(start_mask.cpu().numpy())))              idx1 = np.squeeze(np.argwhere(np.asarray(end_mask.cpu().numpy())))              print('In shape: {:d}, Out shape {:d}.'.format(idx0.size, idx1.size))              if idx0.size == 1:                  idx0 = np.resize(idx0, (1,))              if idx1.size == 1:                  idx1 = np.resize(idx1, (1,))              w1 = m0.weight.data[:, idx0.tolist(), :, :].clone()                # If the current convolution is not the last convolution in the residual block, then we can change the              # number of output channels. Currently we use `conv_count` to detect whether it is such convolution.              if conv_count % 3 != 1:                  w1 = w1[idx1.tolist(), :, :, :].clone()              m1.weight.data = w1.clone()              continue            # We need to consider the case where there are downsampling convolutions.          # For these convolutions, we just copy the weights.          m1.weight.data = m0.weight.data.clone()      elif isinstance(m0, nn.Linear):          idx0 = np.squeeze(np.argwhere(np.asarray(start_mask.cpu().numpy())))          if idx0.size == 1:              idx0 = np.resize(idx0, (1,))            m1.weight.data = m0.weight.data[:, idx0].clone()          m1.bias.data = m0.bias.data.clone()  

Retrain

最後仍然需要Retrain,在CIFAR10和CIFAR100上的測試結果為:

DenseNet的剪枝

前面說清楚了VGGNet和ResNet的剪枝,對於DenseNet的剪枝我們只需要關注和上面兩個剪枝的區別即可。然後觀察了一下,和ResNet完全一致,所以就不再贅述了。這裡只看一下結果測試:

後記

上面介紹了3個主流的Backbone網絡VGG16,Resnet164,DenseNet40的剪枝方法和細節,這三個網絡在CIFAR10/100數據上保證精度不掉(多數情況還提高了精度)的情況下可以剪掉原始模型一半以上的參數,充分證明了這個算法的有效性,並且也是工程友好的。另外這個剪枝代碼配合pytorch->onnx->移動端框架也是比較好移植的。

備註

  • 源碼工程地址:https://github.com/Eric-mingjie/network-slimming