RocketMQ中Broker的刷盘源码分析

  • 2019 年 10 月 3 日
  • 笔记

上一篇博客的最后简单提了下CommitLog的刷盘  【RocketMQ中Broker的消息存储源码分析】 (这篇博客和上一篇有很大的联系)

Broker的CommitLog刷盘会启动一个线程,不停地将缓冲区的内容写入磁盘(CommitLog文件)中,主要分为异步刷盘和同步刷盘

异步刷盘又可以分为两种方式:
①缓存到mappedByteBuffer -> 写入磁盘(包括同步刷盘)
②缓存到writeBuffer -> 缓存到fileChannel -> 写入磁盘 (前面说过的开启内存字节缓冲区情况下)

 

CommitLog的两种刷盘模式:

1 public enum FlushDiskType {  2     SYNC_FLUSH,  3     ASYNC_FLUSH  4 }

同步和异步,同步刷盘由GroupCommitService实现,异步刷盘由FlushRealTimeService实现,默认采用异步刷盘

在采用异步刷盘的模式下,若是开启内存字节缓冲区,那么会在FlushRealTimeService的基础上开启CommitRealTimeService

 

同步刷盘:

启动GroupCommitService线程:

 1 public void run() {   2     CommitLog.log.info(this.getServiceName() + " service started");   3   4     while (!this.isStopped()) {   5         try {   6             this.waitForRunning(10);   7             this.doCommit();   8         } catch (Exception e) {   9             CommitLog.log.warn(this.getServiceName() + " service has exception. ", e);  10         }  11     }  12  13     // Under normal circumstances shutdown, wait for the arrival of the  14     // request, and then flush  15     try {  16         Thread.sleep(10);  17     } catch (InterruptedException e) {  18         CommitLog.log.warn("GroupCommitService Exception, ", e);  19     }  20  21     synchronized (this) {  22         this.swapRequests();  23     }  24  25     this.doCommit();  26  27     CommitLog.log.info(this.getServiceName() + " service end");  28 }

通过循环中的doCommit不断地进行刷盘

doCommit方法:

 1 private void doCommit() {   2     synchronized (this.requestsRead) {   3         if (!this.requestsRead.isEmpty()) {   4             for (GroupCommitRequest req : this.requestsRead) {   5                 // There may be a message in the next file, so a maximum of   6                 // two times the flush   7                 boolean flushOK = false;   8                 for (int i = 0; i < 2 && !flushOK; i++) {   9                     flushOK = CommitLog.this.mappedFileQueue.getFlushedWhere() >= req.getNextOffset();  10  11                     if (!flushOK) {  12                         CommitLog.this.mappedFileQueue.flush(0);  13                     }  14                 }  15  16                 req.wakeupCustomer(flushOK);  17             }  18  19             long storeTimestamp = CommitLog.this.mappedFileQueue.getStoreTimestamp();  20             if (storeTimestamp > 0) {  21                 CommitLog.this.defaultMessageStore.getStoreCheckpoint().setPhysicMsgTimestamp(storeTimestamp);  22             }  23  24             this.requestsRead.clear();  25         } else {  26             // Because of individual messages is set to not sync flush, it  27             // will come to this process  28             CommitLog.this.mappedFileQueue.flush(0);  29         }  30     }  31 }

其中在GroupCommitService中管理着两张List:

1 private volatile List<GroupCommitRequest> requestsWrite = new ArrayList<GroupCommitRequest>();  2 private volatile List<GroupCommitRequest> requestsRead = new ArrayList<GroupCommitRequest>();

GroupCommitRequest中封装了一个Offset

1 private final long nextOffset;

 

这里就需要看到上一篇博客结尾提到的handleDiskFlush方法:

 1 public void handleDiskFlush(AppendMessageResult result, PutMessageResult putMessageResult, MessageExt messageExt) {   2     // Synchronization flush   3     if (FlushDiskType.SYNC_FLUSH == this.defaultMessageStore.getMessageStoreConfig().getFlushDiskType()) {   4         final GroupCommitService service = (GroupCommitService) this.flushCommitLogService;   5         if (messageExt.isWaitStoreMsgOK()) {   6             GroupCommitRequest request = new GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes());   7             service.putRequest(request);   8             boolean flushOK = request.waitForFlush(this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout());   9             if (!flushOK) {  10                 log.error("do groupcommit, wait for flush failed, topic: " + messageExt.getTopic() + " tags: " + messageExt.getTags()  11                     + " client address: " + messageExt.getBornHostString());  12                 putMessageResult.setPutMessageStatus(PutMessageStatus.FLUSH_DISK_TIMEOUT);  13             }  14         } else {  15             service.wakeup();  16         }  17     }  18     // Asynchronous flush  19     else {  20         if (!this.defaultMessageStore.getMessageStoreConfig().isTransientStorePoolEnable()) {  21             flushCommitLogService.wakeup();  22         } else {  23             commitLogService.wakeup();  24         }  25     }  26 }

这个方法的调用发生在Broker接收到来自Producer的消息,并且完成了向ByteBuffer的写入

可以看到,在同步刷盘SYNC_FLUSH模式下,会从AppendMessageResult 中取出WroteOffset以及WroteBytes从而计算出nextOffset,把这个nextOffset封装到GroupCommitRequest中,然后通过GroupCommitService 的putRequest方法,将GroupCommitRequest添加到requestsWrite这个List中
putRequest方法:

1 public synchronized void putRequest(final GroupCommitRequest request) {  2     synchronized (this.requestsWrite) {  3         this.requestsWrite.add(request);  4     }  5     if (hasNotified.compareAndSet(false, true)) {  6         waitPoint.countDown(); // notify  7     }  8 }

在完成List的add操作后,会通过CAS操作修改hasNotified这个原子化的Boolean值,同时通过waitPoint的countDown进行唤醒操作,在后面会有用

由于这里这里是同步刷盘,所以需要通过GroupCommitRequest的waitForFlush方法,在超时时间内等待该记录对应的刷盘完成
而异步刷盘会通过wakeup方法唤醒刷盘任务,并没有进行等待,这就是二者区别

回到doCommit方法中,这时会发现这里是对requestsRead这条List进行的操作,而刚才是将记录存放在requestsWrite这条List中的
这就和在run方法中的waitForRunning方法有关了:

 1 protected void waitForRunning(long interval) {   2    if (hasNotified.compareAndSet(true, false)) {   3         this.onWaitEnd();   4         return;   5     }   6   7     //entry to wait   8     waitPoint.reset();   9  10     try {  11         waitPoint.await(interval, TimeUnit.MILLISECONDS);  12     } catch (InterruptedException e) {  13         log.error("Interrupted", e);  14     } finally {  15         hasNotified.set(false);  16         this.onWaitEnd();  17     }  18 }

这里通过CAS操作修改hasNotified值,从而调用onWaitEnd方法;如果修改失败,则因为await进入阻塞,等待上面所说的putRequest方法将其唤醒,也就是说当Producer发送的消息被缓存成功后,调用handleDiskFlush方法后,唤醒刷盘线工作,当然刷盘线程在达到超时时间interval后也会唤醒

再来看看onWaitEnd方法:

1 protected void onWaitEnd() {  2     this.swapRequests();  3 }  4  5 private void swapRequests() {  6     List<GroupCommitRequest> tmp = this.requestsWrite;  7     this.requestsWrite = this.requestsRead;  8     this.requestsRead = tmp;  9 }

可以看到,这里是将两个List进行了交换

这是一个非常有趣的做法,如果熟悉JVM的话,有没有觉得这其实很像新生代的复制算法!
当刷盘线程阻塞的时候,requestsWrite中会填充记录,当刷盘线程被唤醒工作的时候,首先会将requestsWrite和requestsRead进行交换,那么此时的记录就是从requestsRead中读取的了,而同时requestsWrite会变为空的List,消息记录就会往这个空的List中填充,如此往复

可以看到doCommit方法中,当requestsRead不为空的时候,在最后会调用requestsRead的clear方法,由此证明了我上面的说法

 

仔细来看看是如何进行刷盘的:

 1 for (GroupCommitRequest req : this.requestsRead) {   2    // There may be a message in the next file, so a maximum of   3     // two times the flush   4     boolean flushOK = false;   5     for (int i = 0; i < 2 && !flushOK; i++) {   6         flushOK = CommitLog.this.mappedFileQueue.getFlushedWhere() >= req.getNextOffset();   7   8         if (!flushOK) {   9             CommitLog.this.mappedFileQueue.flush(0);  10         }  11     }  12  13     req.wakeupCustomer(flushOK);  14 }

通过遍历requestsRead,可以到得到GroupCommitRequest封装的NextOffset

其中flushedWhere是用来记录上一次刷盘完成后的offset,若是上一次的刷盘位置大于等于NextOffset,就说明从NextOffset位置起始已经被刷新过了,不需要刷新,否则调用mappedFileQueue的flush方法进行刷盘

MappedFileQueue的flush方法:

 1 public boolean flush(final int flushLeastPages) {   2     boolean result = true;   3     MappedFile mappedFile = this.findMappedFileByOffset(this.flushedWhere, this.flushedWhere == 0);   4     if (mappedFile != null) {   5         long tmpTimeStamp = mappedFile.getStoreTimestamp();   6         int offset = mappedFile.flush(flushLeastPages);   7         long where = mappedFile.getFileFromOffset() + offset;   8         result = where == this.flushedWhere;   9         this.flushedWhere = where;  10         if (0 == flushLeastPages) {  11             this.storeTimestamp = tmpTimeStamp;  12         }  13     }  14  15     return result;  16 }

这里首先根据flushedWhere上一次刷盘完成后的offset,通过findMappedFileByOffset方法,找到CommitLog文件的映射MappedFile
有关MappedFile及其相关操作在我之前的博客中介绍过很多次,就不再累赘

再找到MappedFile后,调用其flush方法:

MappedFile的flush方法:

 1 public int flush(final int flushLeastPages) {   2     if (this.isAbleToFlush(flushLeastPages)) {   3         if (this.hold()) {   4             int value = getReadPosition();   5   6             try {   7                 //We only append data to fileChannel or mappedByteBuffer, never both.   8                 if (writeBuffer != null || this.fileChannel.position() != 0) {   9                     this.fileChannel.force(false);  10                 } else {  11                     this.mappedByteBuffer.force();  12                 }  13             } catch (Throwable e) {  14                 log.error("Error occurred when force data to disk.", e);  15             }  16  17             this.flushedPosition.set(value);  18             this.release();  19         } else {  20             log.warn("in flush, hold failed, flush offset = " + this.flushedPosition.get());  21             this.flushedPosition.set(getReadPosition());  22         }  23     }  24     return this.getFlushedPosition();  25 }

首先isAbleToFlush方法:

 1 private boolean isAbleToFlush(final int flushLeastPages) {   2     int flush = this.flushedPosition.get();   3     int write = getReadPosition();   4   5     if (this.isFull()) {   6         return true;   7     }   8   9     if (flushLeastPages > 0) {  10         return ((write / OS_PAGE_SIZE) - (flush / OS_PAGE_SIZE)) >= flushLeastPages;  11     }  12  13     return write > flush;  14 }

其中flush记录的是上一次完成刷新后的位置,write记录的是当前消息内容写入后的位置
当flushLeastPages 大于0的时候,通过:

1 return ((write / OS_PAGE_SIZE) - (flush / OS_PAGE_SIZE)) >= flushLeastPages;

可以计算出是否满足page的要求,其中OS_PAGE_SIZE是4K,也就是说1个page大小是4k

由于这里是同步刷盘,flushLeastPages是0,不对page要求,只要有缓存有内容就会刷盘;但是在异步刷盘中,flushLeastPages是4,也就是说,只有当缓存的消息至少是4(page个数)*4K(page大小)= 16K时,异步刷盘才会将缓存写入文件

 

回到MappedFile的flush方法,在通过isAbleToFlush检查完写入要求后

 1 int value = getReadPosition();   2 try {   3     //We only append data to fileChannel or mappedByteBuffer, never both.   4     if (writeBuffer != null || this.fileChannel.position() != 0) {   5         this.fileChannel.force(false);   6     } else {   7         this.mappedByteBuffer.force();   8     }   9 } catch (Throwable e) {  10     log.error("Error occurred when force data to disk.", e);  11 }  12  13 this.flushedPosition.set(value);

首先通过getReadPosition获取当前消息内容写入后的位置,由于是同步刷盘,所以这里调用mappedByteBuffer的force方法,通过JDK的NIO操作,将mappedByteBuffer缓存中的数据写入CommitLog文件中
最后更新flushedPosition的值

再回到MappedFileQueue的flush方法,在完成MappedFile的flush后,还需要更新flushedWhere的值

此时缓存中的数据完成了持久化,同步刷盘结束

 

异步刷盘:

①FlushCommitLogService:

 1 public void run() {   2     CommitLog.log.info(this.getServiceName() + " service started");   3   4     while (!this.isStopped()) {   5         boolean flushCommitLogTimed = CommitLog.this.defaultMessageStore.getMessageStoreConfig().isFlushCommitLogTimed();   6   7         int interval = CommitLog.this.defaultMessageStore.getMessageStoreConfig().getFlushIntervalCommitLog();   8         int flushPhysicQueueLeastPages = CommitLog.this.defaultMessageStore.getMessageStoreConfig().getFlushCommitLogLeastPages();   9  10         int flushPhysicQueueThoroughInterval =  11             CommitLog.this.defaultMessageStore.getMessageStoreConfig().getFlushCommitLogThoroughInterval();  12  13         boolean printFlushProgress = false;  14  15         // Print flush progress  16         long currentTimeMillis = System.currentTimeMillis();  17         if (currentTimeMillis >= (this.lastFlushTimestamp + flushPhysicQueueThoroughInterval)) {  18             this.lastFlushTimestamp = currentTimeMillis;  19             flushPhysicQueueLeastPages = 0;  20             printFlushProgress = (printTimes++ % 10) == 0;  21         }  22  23         try {  24             if (flushCommitLogTimed) {  25                 Thread.sleep(interval);  26             } else {  27                 this.waitForRunning(interval);  28             }  29  30             if (printFlushProgress) {  31                 this.printFlushProgress();  32             }  33  34             long begin = System.currentTimeMillis();  35             CommitLog.this.mappedFileQueue.flush(flushPhysicQueueLeastPages);  36             long storeTimestamp = CommitLog.this.mappedFileQueue.getStoreTimestamp();  37             if (storeTimestamp > 0) {  38                 CommitLog.this.defaultMessageStore.getStoreCheckpoint().setPhysicMsgTimestamp(storeTimestamp);  39             }  40             long past = System.currentTimeMillis() - begin;  41             if (past > 500) {  42                 log.info("Flush data to disk costs {} ms", past);  43             }  44         } catch (Throwable e) {  45             CommitLog.log.warn(this.getServiceName() + " service has exception. ", e);  46             this.printFlushProgress();  47         }  48     }  49  50     // Normal shutdown, to ensure that all the flush before exit  51     boolean result = false;  52     for (int i = 0; i < RETRY_TIMES_OVER && !result; i++) {  53         result = CommitLog.this.mappedFileQueue.flush(0);  54         CommitLog.log.info(this.getServiceName() + " service shutdown, retry " + (i + 1) + " times " + (result ? "OK" : "Not OK"));  55     }  56  57     this.printFlushProgress();  58  59     CommitLog.log.info(this.getServiceName() + " service end");  60 }

flushCommitLogTimed:是否使用定时刷盘
interval:刷盘时间间隔,默认500ms
flushPhysicQueueLeastPages:page大小,默认4个
flushPhysicQueueThoroughInterval:彻底刷盘时间间隔,默认10s

首先根据lastFlushTimestamp(上一次刷盘时间)+ flushPhysicQueueThoroughInterval和当前时间比较,判断是否需要进行一次彻底刷盘,若达到了需要则将flushPhysicQueueLeastPages置为0

接着根据flushCommitLogTimed判断
当flushCommitLogTimed为true,使用sleep等待500ms
当flushCommitLogTimed为false,调用waitForRunning在超时时间为500ms下阻塞,其唤醒条件也就是在handleDiskFlush中的wakeup唤醒

最后,和同步刷盘一样,调用mappedFileQueue的flush方法
只不过,这里的flushPhysicQueueLeastPages决定了其是进行彻底刷新,还是按4page(16K)的标准刷新

②CommitRealTimeService
这种刷盘方式需要和FlushCommitLogService配合

CommitRealTimeService的run方法:

 1 public void run() {   2    CommitLog.log.info(this.getServiceName() + " service started");   3     while (!this.isStopped()) {   4         int interval = CommitLog.this.defaultMessageStore.getMessageStoreConfig().getCommitIntervalCommitLog();   5   6         int commitDataLeastPages = CommitLog.this.defaultMessageStore.getMessageStoreConfig().getCommitCommitLogLeastPages();   7   8         int commitDataThoroughInterval =   9             CommitLog.this.defaultMessageStore.getMessageStoreConfig().getCommitCommitLogThoroughInterval();  10  11         long begin = System.currentTimeMillis();  12         if (begin >= (this.lastCommitTimestamp + commitDataThoroughInterval)) {  13             this.lastCommitTimestamp = begin;  14             commitDataLeastPages = 0;  15         }  16  17         try {  18             boolean result = CommitLog.this.mappedFileQueue.commit(commitDataLeastPages);  19             long end = System.currentTimeMillis();  20             if (!result) {  21                 this.lastCommitTimestamp = end; // result = false means some data committed.  22                 //now wake up flush thread.  23                 flushCommitLogService.wakeup();  24             }  25  26             if (end - begin > 500) {  27                 log.info("Commit data to file costs {} ms", end - begin);  28             }  29             this.waitForRunning(interval);  30         } catch (Throwable e) {  31             CommitLog.log.error(this.getServiceName() + " service has exception. ", e);  32         }  33     }  34  35     boolean result = false;  36     for (int i = 0; i < RETRY_TIMES_OVER && !result; i++) {  37         result = CommitLog.this.mappedFileQueue.commit(0);  38         CommitLog.log.info(this.getServiceName() + " service shutdown, retry " + (i + 1) + " times " + (result ? "OK" : "Not OK"));  39     }  40     CommitLog.log.info(this.getServiceName() + " service end");  41 }

这里的逻辑和FlushCommitLogService中相似,之不过参数略有不同

interval:提交时间间隔,默认200ms
commitDataLeastPages:page大小,默认4个
commitDataThoroughInterval:提交完成时间间隔,默认200ms

基本和FlushCommitLogService相似,只不过调用了mappedFileQueue的commit方法

 1 public boolean commit(final int commitLeastPages) {   2     boolean result = true;   3     MappedFile mappedFile = this.findMappedFileByOffset(this.committedWhere, this.committedWhere == 0);   4     if (mappedFile != null) {   5         int offset = mappedFile.commit(commitLeastPages);   6         long where = mappedFile.getFileFromOffset() + offset;   7         result = where == this.committedWhere;   8         this.committedWhere = where;   9     }  10  11     return result;  12 }

这里和mappedFileQueue的flush方法很相似,通过committedWhere寻找MappedFile

然后调用MappedFile的commit方法:

 1 public int commit(final int commitLeastPages) {   2     if (writeBuffer == null) {   3         //no need to commit data to file channel, so just regard wrotePosition as committedPosition.   4         return this.wrotePosition.get();   5     }   6     if (this.isAbleToCommit(commitLeastPages)) {   7         if (this.hold()) {   8             commit0(commitLeastPages);   9             this.release();  10         } else {  11             log.warn("in commit, hold failed, commit offset = " + this.committedPosition.get());  12         }  13     }  14  15     // All dirty data has been committed to FileChannel.  16     if (writeBuffer != null && this.transientStorePool != null && this.fileSize == this.committedPosition.get()) {  17         this.transientStorePool.returnBuffer(writeBuffer);  18         this.writeBuffer = null;  19     }  20  21     return this.committedPosition.get();  22 }

依旧和MappedFile的flush方法很相似,在isAbleToCommit检查完page后调用commit0方法

MappedFile的commit0方法:

 1 protected void commit0(final int commitLeastPages) {   2     int writePos = this.wrotePosition.get();   3     int lastCommittedPosition = this.committedPosition.get();   4   5     if (writePos - this.committedPosition.get() > 0) {   6         try {   7             ByteBuffer byteBuffer = writeBuffer.slice();   8             byteBuffer.position(lastCommittedPosition);   9             byteBuffer.limit(writePos);  10             this.fileChannel.position(lastCommittedPosition);  11             this.fileChannel.write(byteBuffer);  12             this.committedPosition.set(writePos);  13         } catch (Throwable e) {  14             log.error("Error occurred when commit data to FileChannel.", e);  15         }  16     }  17 }

 【RocketMQ中Broker的消息存储源码分析】 

中说过,当使用这种方式时,会先将消息缓存在writeBuffer中而不是之前的mappedByteBuffer
这里就可以清楚地看到将writeBuffer中从lastCommittedPosition(上次提交位置)开始到writePos(缓存消息结束位置)的内容缓存到了fileChannel中相同的位置,并没有写入磁盘
在缓存到fileChannel后,会更新committedPosition值

回到commit方法,在向fileCfihannel缓存完毕后,会检查committedPosition是否达到了fileSize,也就是判断writeBuffer中的内容是不是去全部提交完毕

若是全部提交,需要通过transientStorePool的returnBuffer方法来回收利用writeBuffer
transientStorePool其实是一个双向队列,由CommitLog来管理
TransientStorePool:

 1 public class TransientStorePool {   2     private static final InternalLogger log = InternalLoggerFactory.getLogger(LoggerName.STORE_LOGGER_NAME);   3   4     private final int poolSize;   5     private final int fileSize;   6     private final Deque<ByteBuffer> availableBuffers;   7     private final MessageStoreConfig storeConfig;   8   9     public TransientStorePool(final MessageStoreConfig storeConfig) {  10         this.storeConfig = storeConfig;  11         this.poolSize = storeConfig.getTransientStorePoolSize();  12         this.fileSize = storeConfig.getMapedFileSizeCommitLog();  13         this.availableBuffers = new ConcurrentLinkedDeque<>();  14     }  15     ......  16 }

returnBuffer方法:

1 public void returnBuffer(ByteBuffer byteBuffer) {  2     byteBuffer.position(0);  3     byteBuffer.limit(fileSize);  4     this.availableBuffers.offerFirst(byteBuffer);  5 }

这里就可以清楚地看到byteBuffer确实被回收了

 

回到MappedFileQueue的commit方法:

 1 public boolean commit(final int commitLeastPages) {   2     boolean result = true;   3     MappedFile mappedFile = this.findMappedFileByOffset(this.committedWhere, this.committedWhere == 0);   4     if (mappedFile != null) {   5         int offset = mappedFile.commit(commitLeastPages);   6         long where = mappedFile.getFileFromOffset() + offset;   7         result = where == this.committedWhere;   8         this.committedWhere = where;   9     }  10  11     return result;  12 }

在完成mappedFile的commit后,通过where和committedWhere来判断是否真的向fileCfihannel缓存了 ,只有确实缓存了result才是false!
之后会更新committedWhere,并返回result

 

那么回到CommitRealTimeService的run方法,在完成commit之后,会判断result
只有真的向fileCfihannel缓存后,才会调用flushCommitLogService的wakeup方法,也就是唤醒了FlushCommitLogService的刷盘线程

唯一和之前分析的FlushCommitLogService不同的地方是在MappedFile的flush方法中:

1 if (writeBuffer != null || this.fileChannel.position() != 0) {  2     this.fileChannel.force(false);  3 } else {  4     this.mappedByteBuffer.force();  5 }

之前在没有开启内存字节缓冲区的情况下,是将mappedByteBuffer中的内容写入磁盘
而这时,终于轮到fileChannel了

可以看到这里的条件判断,当writeBuffer不等与null,或者fileChannel的position不等与0
writeBuffer等于null的情况会在TransientStorePool对其回收之后

到这里就可以明白开启内存字节缓冲区的情况下,其实是进行了两次缓存才写入磁盘

 

至此,Broker的消息持久化以及刷盘的整个过程完毕