【Transferable NAS with RL】2018-CVPR-Learning Transferable Architectures for Scalable Image Recognition-論文閱讀

Transferable NAS with RL

2018-CVPR-Learning Transferable Architectures for Scalable Image Recognition

Introduction

In this paper, we study a method to learn the model architectures directly on the dataset of interest.

本文介紹一種在目標數據集上搜索網路結構的方法。

We also introduce a new regularization technique called ScheduledDropPath that significantly improves generalization in the NASNet models.

介紹了一種新的正則化技術:ScheduledDropPath,有效地提高NASNet model的性能

On CIFAR-10 itself, a NASNet found by our method achieves 2.4% error rate, which is state-of-the-art.

在CIFAR-10上搜索的NASNet模型,達到了2.4% err(97.6% acc)

Although the cell is not searched for directly on ImageNet, a NASNet constructed from the best cell achieves, among the published works, state-of-the-art accuracy of 82.7% top-1 and 96.2% top-5 on ImageNet.

儘管NASNet沒有直接在ImageNet上搜索,但依然達到了top1 acc 82.7%(超過所有已發表工作,在未發表的工作中持平), top5 acc 96.2%

Our model is 1.2% better in top-1 accuracy than the best human-invented architectures while having 9 billion fewer FLOPS – a reduction of 28% in computational demand from the previous state-of-the-art model.

我們的模型在top1指標上比手工設計的最好模型高了1.2%,FLOPs少了9 billion

For instance, a small version of NASNet also achieves 74% top-1 accuracy, which is 3.1% better than equivalently-sized, state-of-the-art models for mobile platforms.

此外,小版本的NASNet也達到了74%的top1 acc,比同樣大小的網路提高了3.1%,達到了小網路中的SOTA

Finally, the image features learned from image classification are generically useful and can be transferred to other computer vision problems.

最後,從影像分類中學到的特徵是有用的,可以遷移到其他視覺任務中。

On the task ofobject detection, the learned features by NASNet used with the Faster-RCNN framework surpass state-of-the-art by 4.0% achieving 43.1% mAP on the COCO dataset.

在目標檢測任務中,NASNet使用Faster-RCNN超過sota 4.0%,達到43.1%的mAP

Motivation

As this approach is expensive when the dataset is large, we propose to search for an architectural building block on a small dataset and then transfer the block to a larger dataset.

當目標數據集太大時,直接搜索整個網路的方法計算代價過高,我們提出先在小的數據集上搜索網路構建塊,再遷移到大數據集上。

One inspiration for the NASNet search space is the realization that architecture engineering with CNNs often identifies repeated motifs consisting of combinations of convolutional filter banks,

NASNet search space 的一個靈感是發現CNN的結構經常使用重複的模組

Contribution

Transferable 可遷移,小數據集到大數據集遷移,從分類任務遷移到其他視覺任務

Scalable 可擴展,容易從大模型擴展到小模型

The main contribution of this work is the design of a novel search space, such that the best architecture found on the CIFAR-10 dataset would scale to larger, higher resolution image datasets across a range of computational settings.

主要貢獻是設計了一個新的搜索空間(NASNet search space),可以在cifar10上搜索最佳結構,並推廣到更大,更高解析度的數據集。

Our approach is inspired by the recently proposed Neural Architecture Search (NAS) framework [71], which uses a reinforcement learning search method to optimize architecture configurations.

搜索最佳的網路結構,簡化為搜索最佳的結構塊(cell)

Searching for the best cell structure has two main benefits: it is much faster than searching for an entire network architecture and the cell itself is more likely to generalize to other problems.

搜索最佳結構塊(cell)有2個好處:更快,泛化能力更強(容易被推廣到其他問題)

Additionally, by simply varying the number of the convolutional cells and number of filters in the convolutional cells, we can create different versions of NASNets with different computational demands.

另外,通過簡單地改變cell的數量和cell中卷積核的個數(縮放),可以構建不同計算開銷版本的NASNet

Thanks to this property of the cells, we can generate a family of models that achieve accuracies superior to all human-invented models at equivalent or smaller computational budgets [60, 29].

由於cell的特性,我們可以生成一系列模型

Method

Search Method 搜索方法(強化學習/隨機搜索)

Our approach is inspired by the recently proposed Neural Architecture Search (NAS) framework [71], which uses a reinforcement learning search method to optimize architecture configurations.

我們的方法的框架與[71]NAS with RL相同,都使用強化學習來搜索網路結構

The controller weights are updated with policy gradient (see Figure 1).

image-20200611160748367

控制器RNN使用梯度策略更新

Network Architecture 網路結構

This cell can then be stacked in series to handle inputs of arbitrary spatial dimensions and filter depth.

這個cell可以被多次堆疊以處理任意解析度的輸入

In our approach, the overall architectures of the convolutional nets are manually predetermined.

我們的方法中,整個網路的總體架構(cell堆疊次數/層數)是事先手動確定的。

They are composed of convolutional cells repeated many times where each convolutional cell has the same architecture, but different weights.

整個網路是由多個卷積cell重複多次得到的,這些cell有相同的結構,但是權重不同

To easily build scalable architectures for images of any size, we need two types of convolutional cells to serve two main functions when taking in a feature map as input:

(1) convolutional cells that return a feature map of the same dimension, and

(2) convolutional cells that return a feature map where the feature map height and width is reduced by a factor of two.

We name the first type and second type of convolutional cells Normal Cell and Reduction Cell respectively.

為了處理不同輸入大小的圖片,我們需要兩種cell

(1)Normal Cell(cell的輸入輸出長寬大小相同)

(2)Reduction Cell(cell的輸入輸出長寬減半)

The Reduction and Normal Cell could have the same architecture, but we empirically found it beneficial to learn two separate architectures.

Reduction Cell 和 Normal Cell 結構可以是一樣的,但根據經驗,學習不同的結構會更好。

All of our operations that we consider for building our convolutional cells have an option of striding.

cell中所有的operation的stride是可變的

Figure 2 shows our placement of Normal and Reduction Cells for CIFAR-10 and ImageNet.

圖2是 cifar10 和 ImageNet 的Normal cell 和 Reduction Cell

image-20200611160928904

Note on ImageNet we have more Reduction Cells, since the incoming image size is 299×299 compared to 32×32 for CIFAR.

注意到 ImageNet 的結構有更多的Reduction Cells,因為輸入是299×299,而cifar10是32×32

We use a common heuristic to double the number of filters in the output whenever the spatial activation size is reduced in order to maintain roughly constant hidden state dimension [32, 53].

我們使用了一種常見的方法,即當feature map減小時(減半時),倍增卷積核的個數,以保持hidden state維度大致不變

Importantly, much like Inception and ResNet models [59, 20, 60, 58], we consider the number of motif repetitions N and the number of initial convolutional filters as free parameters that we tailor to the scale of an image classification problem.

就像 Inception 和 ResNet,我們將cell重複的次數N和 &&初始化的卷積核數量&& 設置為可變的超參數,可以後期設置以適應不同的圖片分類問題

Search Space 搜索空間

The structures of the cells can be searched within a search space defined as follows (see Appendix, Figure 7 for schematic).

cell的搜索空間定義如圖7所示

image-20200611161128374

In our search space, each cell receives as input two initial hidden states hi and hi−1 which are the outputs of two cells in previous two lower layers or the input image.

在我們的搜索空間中,每個cell輸入2個hidden state hi 和 hi-1,這2個hidden state 分別是先前cell的輸出,或者輸入影像本身。

The controller RNN recursively predicts the rest of the structure of the convolutional cell, given these two initial hidden states (Figure 3).

控制器RNN根據這2個hidden stages,遞歸地預測 && cell的剩餘結構 &&

image-20200611161220465

The predictions of the controller for each cell are grouped into B blocks, where each block has 5 prediction steps made by 5 distinct softmax classifiers corresponding to discrete choices of the elements of a block:

控制器RNN對每個cell的預測分為B個blocks(每個cell中有B個節點,即每個block預測每個節點的操作),每組由5個softmax分類器進行5個預測步驟

  • Step 1. Select a hidden state from hi, hi−1 or from the set of hidden states created in previous blocks.

  • Step 2. Select a second hidden state from the same options as in Step 1.

  • Step 3. Select an operation to apply to the hidden state selected in Step 1.

  • Step 4. Select an operation to apply to the hidden state selected in Step 2.

  • Step 5. Select a method to combine the outputs of Step 3 and 4 to create a new hidden state.

  • Step 1. 從hidden state set中選擇一個,作為該組的第一個input hidden state

  • Step 2. 從hidden state set中再選擇一個,作為該組的第二個input hidden state

  • Step 3. 選擇要對第一個input執行的operation,得到output 1

  • Step 4. 選擇要對第二個input執行的operation,得到output 2

  • Step 5. 選擇結合output 1 和 output 2的方法,得到當前cell的輸出(新的hidden state)

The algorithm appends the newly-created hidden state to the set of existing hidden states as a potential input in subsequent blocks.

該演算法每次都將新的hidden state加入已有的hidden states set,作為後續cell 的輸入的選擇空間

The controller RNN repeats the above 5 prediction steps B times corresponding to the B blocks in a convolutional cell.

控制器RNN重複上述5個步驟B次,即對應cell中的B個節點

In our experiments, selecting B = 5 provides good results, although we have not exhaustively searched this space due to computational limitations.

我們實驗中選擇B=5,由於算力限制,我們沒有徹底地搜索整個空間。

In steps 3 and 4, the controller RNN selects an operation to apply to the hidden states.

We collected the following set of operations based on their prevalence in the CNN literature:

step 3/4中的對input hidden state執行的operation,從以下集合中選擇:

image-20200611161442223

In step 5 the controller RNN selects a method to combine the two hidden states either

(1) element-wise addition between two hidden states or

(2) concatenation between two hidden states along the filter dimension.

Step 5 中控制器RNN選擇結合兩個output的操作從以下2種操作中選擇;

(1)兩個output 對應元素相加

(2)兩個output在filter維度上堆疊&& (通道堆疊?)

Finally, all of the unused hidden states generated in the convolutional cell are concatenated together in depth to provide the final cell output.

最後,stage set中所有沒有被後續cell使用(作為輸入)過的hidden stages,在深度上堆疊,作為最後一個cell的輸出

To allow the controller RNN to predict both Normal Cell and Reduction Cell, we simply make the controller have 2 × 5B predictions in total, where the first 5B predictions are for the Normal Cell and the second 5B predictions are for the Reduction Cell.

為了讓控制器RNN預測 Normal Cell 和 Reduction Cell,我們將控制器有2×5B個預測(Normal Cell 有B節點,Reduction Cell有B個節點,2種Cell都各有5個節點)

Search Process 搜索過程

Finally, our work makes use of the reinforcement learning proposal in NAS [71]

最後,我們使用[71]中的強化學習對搜索空間進行搜索

however, it is also possible to use random search to search for architectures in the NASNet search space.

不過也可以用隨機搜索對搜索空間進行搜索

In random search, instead of sampling the decisions from the softmax classifiers in the controller RNN, we can sample the decisions from the uniform distribution.

在隨機搜索中,可以不使用控制器RNN中的softmax對搜索空間進行取樣,而是使用均勻分布取樣

In our experiments, we find that random search is slightly worse than reinforcement learning on the CIFAR10 dataset.Although there is value in using reinforcement learning, the gap is smaller than what is found in the original work of [71].

實驗中我們發現隨機搜索比強化學習搜索效果略差,但差距小於原始論文[71]中提到的。

This result suggests that

  1. the NASNet search space is well-constructed such that random search can perform reasonably well and

  2. random search is a difficult baseline to beat.

We will compare reinforcement learning against random search in Section 4.4.

這個結果說明,

(1)NASNet search space 是一個構建的較好的搜索空間,使得隨機搜索也表現得很好

(2)隨機搜索也是一個強力的baseline

我們將在Section 4.4中對比強化學習搜索和隨機搜索

Sec 4. Experiments

Hardware

In our experiments, the pool of workers in the workqueue consisted of 500 GPUs.

實驗中使用了500個GPU

The result of this search process over 4 days yields several candidate convolutional cells.

搜索過程進行了4天,得到了幾個候選cell

We note that this search procedure is almost 7× faster than previous approaches [71] that took 28 days.*1

我們的搜索過程比之前的工作(28天)[71]快了7倍

*1.we note that previous architecture search [71] used 800 GPUs for 28 days resulting in 22,400 GPU-hours.

[71]使用800個gpu計算28天,總計22400個gpu hours

The method in this paper uses 500 GPUs across 4 days resulting in 2,000 GPU-hours.

本文的方法使用了500個gpu計算4天,總計2000個gpu hours

The former effort used Nvidia K40 GPUs, whereas the current efforts used faster NVidia P100s. Discounting the fact that the we use faster hardware, we estimate that the current procedure is roughly about 7× more efficient.

之前的工作使用了K40 GPU,我們使用P100,考慮到硬體性能的差距,我們大致認為有7倍的加速

Best Cell

Figure 4 shows a diagram of the top performing Normal Cell and Reduction Cell.

圖4展示了性能最佳的Normal Cell 和 Reduction Cell

image-20200611221248662

Sec 4.1 Results on CIFAR-10 Image Classification

For the task of image classification with CIFAR-10, we set N = 4 or 6 (Figure 2).

對於cifar10分類任務,我們將N設置為4或6

image-20200611160928904

The test accuracies of the best architectures are reported in Table 1 along with other state-of-the-art models.

最佳結構和其他sota模型的性能如表1所示:

image-20200611221621030

As can be seen from the Table, a large NASNet-A model with cutout data augmentation [12] achieves a state-of-the-art error rate of 2.40% (averaged across 5 runs), which is slightly better than the previous best record of 2.56% by [12].

The best single run from our model achieves 2.19% error rate.

從表中可以看出,含有cotout數據增強的NASNet-A模型達到了sota 2.40% err(5次平均),比之前的2.56% err更好,單次運行可以達到2.19% err

Sec 4.2. Results on ImageNet Image Classification

We emphasize that we merely transfer the architectures from CIFAR-10 but train all ImageNet models weights from scratch.

Results are summarized in Table 2 and 3 and Figure 5.

在ImageNet上使用和cifar10相同的cell結構,但權重是重新訓練的

We show that this family of models achieve state-of-the-art performance with fewer floating point operations and parameters than comparable architectures

搜素到的model family達到了stoa性能且使用了更少的FLOPs和參數量

Second, we demonstrate that by adjusting the scale of the model we can achieve state-of-the-art performance at smaller computational budgets

通過調整模型大小,我們可以在小計算代價上達到sota

Note we do not have residual connections between convolutional cells as the models learn skip connections on their own. We empirically found manually inserting residual connections between cells to not help performance

注意我們沒有在cell之間手動設置residual connection,而是模型自己學習skip connection,經驗表明在cell之間手動插入resdual connection沒有幫助

Our training setup on ImageNet is similar to [60], but please see Appendix A for details.

我們在ImageNet上的訓練設置類似[60],具體細節參考附錄A

Table 2 shows that the convolutional cells discovered with CIFAR-10 generalize well to ImageNet problems.

表2說明在cifar10上發現的cell很好的推廣到ImageNet上

image-20200611221703738

Importantly, the largest model achieves a new state-of-the-art performance for ImageNet (82.7%) based on single, non-ensembled predictions, surpassing previous best published result by ∼1.2% [8].

很重要的一點,搜素到的最大的模型達到了ImageNet上新的sota 82.7% acc,超過了已發表的論文1.2%

Among the unpublished works, our model is on par with the best reported result of 82.7% [25], while having significantly fewer floating point operations.

在未發表的論文中,我們達到了相同的精度82.7 acc,但是有著更少的FLOPs

Figure 5 shows a complete summary of our results in comparison with other published results.

圖5是其他模型的對比

image-20200611221752939

Finally, we test how well the best convolutional cells may perform in a resource-constrained setting, e.g., mobile devices (Table 3).

最後我們在性能約束的條件下測試最佳cell的表現

image-20200611221716754

An architecture constructed from the best convolutional cells achieves superior predictive performance (74.0% accuracy) surpassing previous models but with comparable computational demand.

我們的架構在相當計算開銷下,超過了之前的模型

In summary, we find that the learned convolutional cells are flexible across model scales achieving state-of-the-art performance across almost 2 orders of magnitude in computational budget.

總之,學習cell是一種靈活地調整模型規模(可以調整超過2個數量級),並達到sota的方法

Sec 4.3. Improved features for object detection

To address this question, we plug in the family of NASNet-A networks pretrained on ImageNet into the Faster-RCNN object detection pipeline [47] using an opensource software platform [28].

我們將在ImageNet上得到的NASNet-A網路應用到目標檢測上

For the mobile-optimized network, our resulting system achieves a mAP of 29.6% – exceeding previous mobile-optimized networks that employ Faster-RCNN by over 5.0% (Table 4).

在移動優化的網路上,比之前的工作提高了5.0%

image-20200611221728890

These results provide further evidence that NASNet provides superior, generic image features that may be transferred across other computer vision tasks.

這個結果表明NASNet的優越性,影像特徵可以在不同的視覺任務中遷移

Figure 10 and Figure 11 in Appendix C show four examples of object detection results produced by NASNet-A with the Faster-RCNN framework.

image-20200611222244563
image-20200611222310636

Sec 4.4. Efficiency of architecture search methods

Though what search method to use is not the focus of the paper, an open question is how effective is the reinforcement learning search method.

什麼樣的搜索方法不是本文的重點,一個開放的問題是,強化學習有多有效?

Figure 6 shows the performance of reinforcement learning (RL) and random search (RS) as more model architectures are sampled.

圖6展示了強化學習和隨機搜索的結果

image-20200612102034130

Note that the best model identified with RL is significantly better than the best model found by RS by over 1% as measured by on CIFAR-10.

注意到強化學習比隨機搜索在cifar10上高出1%

Conclusion

In this work, we demonstrate how to learn scalable, convolutional cells from data that transfer to multiple image classification tasks.

本文我們提出了一種可以在不同分類任務中遷移的可縮放的卷積cell搜索方法

The learned architecture is quite flexible as it may be scaled in terms of computational cost and parameters to easily address a variety of problems.

學到的結構非常靈活,可以根據計算限制,參數量限制進行縮放

In all cases, the accuracy of the resulting model exceeds all human-designed models – ranging from models designed for mobile applications to computationally-heavy models designed to achieve the most accurate results.

在本文提到的領域,搜索到的模型精度都超過了手工設計的模型(不論是計算限制的移動模型還是追求精度的大模型)

The key insight in our approach is to design a search space that decouples the complexity of an architecture from the depth of a network.

本文方法的關鍵在於設計一個搜索空間,將結構的複雜度和深度解耦(即 將搜索整個網路,簡化為搜索cell再堆疊cell)。

This resulting search space permits identifying good architectures on a small dataset (i.e., CIFAR-10) and transferring the learned architecture to image classifications across a range of data and computational scales.

由此產生的搜索空間,可以在小數據集上搜索好的結構並遷移到大的數據集上

Finally, we demonstrate that we can use the resulting learned architecture to perform ImageNet classification with reduced computational budgets that outperform streamlined architectures targeted to mobile and embedded platforms [24, 70].

最後,使用我們學到的模型在計算開銷限制的條件下的性能比之前的小模型結果更好。

Summary

  • 搜索空間簡化:受到resnet-like結構啟發,將直接搜索大型CNN簡化為 搜索cell再堆疊cell,使得網路的搜索空間簡化為cell的搜索空間;進而獲得了結構的遷移性(不同規模數據集之間遷移,不同CV任務之間遷移)
  • 不同解析度適應:設計了Normal Cell 和 Reduction Cell,使得堆疊後的網路可以處理任意解析度
  • 每個cell內的節點結構部分固定(都有一個終節點進行contact)

Reference