【論文考古】聯邦學習開山之作 Communication-Efficient Learning of Deep Networks from Decentralized Data
B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, 「Communication-Efficient Learning of Deep Networks from Decentralized Data,」 in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Apr. 2017, pp. 1273–1282.
聯邦學習
特徵
- unbalanced and non-IID data:數據的異構性是FL的決定性特徵。
- massively distributed:用戶數量比每個用戶的平均樣本數量還要多
- limited communication (client availability):考慮offline/slow/expensive connections
優勢:communication-efficient
communication-efficient的含義並不是相較於傳輸整體數據或整個網路結構,只傳輸參數更新會降低通訊開銷。而是和同步的SGD(僅用所有本地數據訓練一次就進行參數合併,是當時的基於數據中心訓練方法的SOTA)相比,在更少的通訊次數下就能達到目標準確率(減少10到100倍的通訊次數)。
our goal is to use additional computation in order to decrease the number of rounds of communication needed to train a model
對於不傳輸本地數據這一點,作者強調的是隱私保護,而不是節省通訊開銷。
核心演算法:FedAvg
精彩觀點
-
每次更新只針對當前模型,因此不建議利用連續兩次更新的相關性
Since these updates are specific to improving the current model, there is no reason to store them once they have been applied.
-
每一輪的用戶參與並不是越多越好,需要考慮一個性能和通訊的折衷
We only select a fraction of clients for efficiency, as our experiments show diminishing returns for adding more clients beyond a certain point.
-
FedAvg的有很強的魯棒性,作者推測是因為帶來了類似於dropout的正則化作用
averaging provides any advantage (vs. actually diverging) when we naively average the parameters of models trained on entirely different pairs of digits. Thus, we view this as strong evidence for the robustness of this approach
We conjecture that in addition to lowering communication costs, model averaging produces a regularization benefit similar to that achieved by dropout [36]
-
batch size只要和硬體相匹配,減少它就不會顯著增加計算時間
As long as B is large enough to take full advantage of available parallelism on the client hardware, there is essentially no cost in computation time for lowering it, and so in practice this should be the first parameter tuned.
性能提升
- 多個模型框架、大小規模都能應用
- 2層NN,16萬參數;3層CNN,166萬參數
- MNIST:100個用戶,non iid下每個用戶包括的手寫數字不超過2個,CNN下97次通訊可以達到99%準確率,比FedSGD快10倍;NN下380次通訊達到97%正確率;iid 下CNN本地參數更新1200次,18次通訊達到99%準確率,通訊次數下降35倍
- Cifar10:100個用戶,80%準確率,通訊280次,快64倍
- Shakespeare: 1146個用戶,達到54%準確率,non IID下快95倍
- 大規模LSTM:50萬個用戶,一千萬的post,每次200個用戶更新,準確率10.5%,快23倍
- 本地訓練batch size取10或50,epoch取5或20,fraction取0.1
挖的坑
-
文章的訓練對象是mobile devices,因此和通訊結合是自然而然的
the identification of the problem of training on decentralized data from mobile devices as an important research direction
- 不穩定通訊情況下的調度
- 考慮通訊資費的博弈論角度
- 通訊中誤碼率的影響、傳輸速率的影響
-
異構數據
-
數據初始分布不同有何影響(每個用戶的損失函數都不同)
\(F_k\) could be an arbitrarily bad approximation to \(f\)
\[f(w)=\sum_{k=1}^{K} \frac{n_{k}}{n} F_{k}(w) \quad \text { where } \quad F_{k}(w)=\frac{1}{n_{k}} \sum_{i \in \mathcal{P}_{k}} f_{i}(w)
\] -
訓練中數據的增刪有何影響
-
數據的上線時段不同有何影響
-
在不平衡的數據分布下,小數據集的過擬合程度很大,也沒有影響嗎?
-
-
網路參數傳輸
-
部分網路傳輸
-
one-shot averaging(多半是正則的)訓練完後直接合併
-
本地訓練的過擬合程度和發散究竟有何關係?
- Shakespeare LSTM過擬合後發散嚴重,但是MNIST CNN沒有(但還是本地越多越容易發散)
- 大規模LSTM時,epoch為1時的訓練速度比epoch為5時更快
This result suggests that for some models, especially in the later stages of convergence, it may be useful to decay the amount of local computation per round (moving to smaller E or larger B) in the same way decaying learning rates can be useful.
-
評價
文章價值
新意100×有效1000×研究問題100
為什麼能誕生FL
當兩個模型採用同一套參數初始值時,過擬合訓練後直接參數平均就能提高模型性能!所以和分散式SGD的每本地訓練一次就上傳相比,大大減少了通訊的次數。
這個發現是在IID的情況下做的,模擬下發現在non IID下也有顯著提升。但是沒有IID下提升那麼明顯,可能是個可以挖的坑。
Recent work indicates that in practice, the loss surfaces of sufficiently over-parameterized NNs are surprisingly well-behaved and in particular less prone to bad local minima than previously thought [11, 17, 9].
we find that naive parameter averaging works surprisingly well
the average of these two models, \(\frac{1}{2}w+ \frac{1}{2}w^\prime\), achieves significantly lower loss on the full MNIST training set than the best model achieved by training on either of the small datasets independently.
為什麼FL能這麼火
- 時代的潮流:大量用戶、設備算力增強、隱私越來越被重視、有切實的應用價值
- 足夠簡單的框架,很容易follow,馬太效應
提示與啟發
- 在伺服器端用proxy data 是常規操作(雖然FL不需要),但其實和用戶的真實數據集還是存在差異
- 先用多個(2000)individual training+proxy data進行調參
- next-word prediction是FL的最佳應用場景,符合真實數據、隱私保護、不需要額外標籤三個FL特徵
- 一項工作並不是因為他是另外一項工作的直接推廣就沒有創新。一般的直接推廣通常是不能應用、或違反當時人們直覺的。如果在更改某個簡單設置後帶來了顯著的性能提升,那麼無疑是巨大的創新。