netty與nio中的zero copy

2019 年 10 月 4 日
筆記

首先看下維基百科上對zero-copy的介紹:Zero-copy versions of operating system elements, such as device drivers, file systems, and network protocol stacks, greatly increase the performance of certain application programs and more efficiently utilize system resources. Performance is enhanced by allowing the CPU to move on to other tasks while data copies proceed in parallel in another part of the machine. Also, zero-copy operations reduce the number of time-consuming mode switches between user space and kernel space. System resources are utilized more efficiently since using a sophisticated CPU to perform extensive copy operations, which is a relatively simple task, is wasteful if other simpler system components can do the copying.

前言

維基百科中有介紹，在傳統的方式裡面，讀取並通過網路發送一個文件在每次讀或者寫時都需要兩次數據拷貝和兩次上下文切換。其中的一次數據拷貝是通過CPU來完成的。通過zero-copy來傳送文件可以將上下文切換減少到兩次並且可以消除所有的cpu數據拷貝。原文如下:

As an example, reading a file and then sending it over a network the traditional way requires two data copies and two context  switches per read/write cycle. One of those data copies uses the CPU.  Sending the same file via zero copy reduces the context switches to two and eliminates all CPU data copie

傳統I/O

傳統I/O通過兩條系統指令read、write來完成數據的讀取和傳輸操作。

第一次拷貝為DMA copy: hard drive ——> kernel buffer。通過DMA引擎將文件中的數據從磁碟上讀取到內核空間緩衝區，導致用戶空間到內核空間的上下文切換(第一次從系統read上下文切換)。
第二次拷貝為CPU copy: kernel buffer ——> user buffer 將內核空間緩衝區的數據拷貝到用戶空間緩衝區。系統調用的返回又會導致一次內核空間到用戶空間的上下文切換(第二次read上下文切換)。
第三次拷貝為CPU copy: user buffer ——> socket buffer 將用戶空間緩衝區中的數據拷貝到內核空間中與socket相關聯的緩衝區中。導致用戶空間到內核空間的上下文切換(第一次向系統write上下文切換)。
第四次拷貝為DMA copy: socket buffer ——> protocol engine 通過DMA引擎將內核緩衝區中的數據傳遞到協議引擎，導致內核空間到用戶空間的再次上下文切換(第二次write上下文切換)。

可以看到，在每次讀或者寫時都有兩次data copy和兩次上下文的切換。

Linux 2.1內核開始引入了sendfile函數

sendfile只使用了一條指令就完成了數據的讀寫操作。

第一次拷貝是從磁碟到kernel buffer，發生第一次上下文切換;
第二次拷貝是從kernel buffer到socket buffer;
第三次拷貝是從socker buffer到protocol engine，發生第二次上下文切換。

總共有三次data copy和兩次上下文切換，較傳統方式有很大改進。

在Java NIO包中提供了零拷貝機制對應的API，即FileChannel.transferTo()方法。不過FileChannel類是抽象類，transferTo()也是一個抽象方法，因此還要依賴於具體實現。

public void transferTo(long position, long count, WritableByteChannel target);    底層調用的方法:  #include <sys/socket.h>  ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

Linux 2.4版本的sendfile

從Linux 2.4版本開始，作業系統底層提供了帶有scatter/gather的DMA來從內核空間緩衝區中將數據讀取到協議引擎中。這樣一來待傳輸的數據可以分散在存儲的不同位置上，而不需要在連續存儲中存放。那麼從文件中讀出的數據就根本不需要被拷貝到socket緩衝區中去，只是需要將緩衝區描述符添加到socket緩衝區中去，DMA收集操作會根據緩衝區描述符中的資訊將內核空間中的數據直接拷貝到協議引擎中。

第一次拷貝是從磁碟到kernel buffer,會發生一次上下文切換。
第二次拷貝是從kernel buffer到socket buffer(DMA gather copy根據socket緩衝區中描述符提供的位置和偏移量資訊直接將內核空間緩衝區中的數據拷貝到協議引擎上)，會進行第二次上下文切換。

注意：在這裡，所有的cpu copy都被消除掉了，而且上下文切換被減少到兩次。

mmap

mmap(記憶體映射)是一個比sendfile昂貴但優於傳統I/O的方法。

可見與上面的傳統IO方式比較，減少了一次cpu copy。

NIO框架中提供了MappedByteBuffer用來支援mmap。它與常用的DirectByteBuffer一樣，都是在堆外記憶體分配空間。相對地，HeapByteBuffer在堆內記憶體分配空間。

應用

在kafka的PlaintextTransportLayer的對應方法中，就是直接調用了FileChannel.transferTo()方法;
在spark中以BypassMergeSortShuffleWriter為例，它最終是調用了通用工具類Utils中的copyFileStreamNIO()方法。
Netty的文件傳輸調用FileRegion包裝的transferTo方法，可以直接將文件緩衝區的數據發送到目標Channel，避免通過循環write方式導致的記憶體拷貝問題。FileRegion底層調用NIO FileChannel的transferTo函數。

netty的其他zero copy

通過CompositeByteBuf實現零拷貝

CompositeByteBuf可以把需要合併的多個bytebuf組合起來，對外提供統一的readIndex和writerIndex。但在CompositeByteBuf內部, 合併的多個ByteBuf都是單獨存在的，CompositeByteBuf 只是邏輯上是一個整體。CompositeByteBuf裡面有個Component數組，聚合的bytebuf都放在Component數組裡面，最小容量為16。

通過wrap操作實現零拷貝

通過Unpooled.wrappedBuffer方法將bytes包裝為一個UnpooledHeapByteBuf對象，而在包裝的過程中, 不會有拷貝操作的，即生成的ByteBuf對象是和bytes數組共用了同一個存儲空間，對bytes的修改也就是對ByteBuf對象的修改。

通過slice操作實現零拷貝

用slice方法產生header和body的過程是沒有拷貝操作的，header和body對象在內部其實是共享了byteBuf存儲空間的不同部分而已。

參考

https://en.wikipedia.org/wiki/Zero-copy
https://www.jianshu.com/p/e76e3580e356
https://www.jianshu.com/p/193cae9cbf07
https://www.zhihu.com/question/57374068
https://www.jianshu.com/p/e488c8ee5b57