十億級流量下，我與Redis時延小突刺的戰鬥史

2021 年 6 月 22 日
筆記
Jedis, Redis, 數據庫連接池, 調用鏈

一、背景

某一日收到上游調用方的反饋，提供的某一個Dubbo接口，每天在固定的時間點被短時間熔斷，拋出的異常信息為提供方dubbo線程池被耗盡。當前dubbo接口日請求量18億次，報錯請求94W/天，至此開始了優化之旅。

二、快速應急

2.1 快速定位

首先進行常規的系統信息監控（機器、JVM內存、GC、線程），發現雖稍有突刺，但都在合理範圍內，且跟報錯時間點對不上，先暫時忽略。

其次進行流量分析，發現每天固定時間點會有流量突增的情況，流量突增的點跟報錯的時間點也吻合，初步判斷為短時大流量導致。

流量趨勢

被降級量

接口99線

三、尋找性能瓶頸點

3.1 接口流程分析

3.1.1 流程圖

3.1.2 流程分析

收到請求後調用下游接口，使用hystrix熔斷器，熔斷時間為500MS；

根據下游接口返回的數據，進行詳情數據的封裝，第一步先到本地緩存中獲取，如果本地緩存沒有，則從Redis進行回源，Redis中無則直接返回，異步線程從數據庫進行回源。

如果第一步調用下游接口異常，則進行數據兜底，兜底流程為先到本地緩存中獲取，如果本地緩存沒有，則從Redis進行回源，Redis中無則直接返回，異步線程從數據庫進行回源。

3.2 性能瓶頸點排查

3.2.1 下游接口服務耗時比較長

調用鏈顯示，雖然下游接口的P99線在峰值流量時存在突刺，超出1S，但因為熔斷超時的設置（熔斷時間500MS，coreSize&masSize=50，下游接口平均耗時10MS以下），判斷下游接口不是問題的關鍵點，為進一步排除干擾，在下游服務存在突刺時能快速失敗，調整熔斷時間為100MS，dubbo超時時間100MS。

3.2.2 獲取詳情本地緩存無數據，Redis回源

藉助調用鏈平台，第一步分析Redis請求流量，以此來判斷本地緩存的命中率，發現Redis的流量是接口流量的2倍，從設計上來說不應該出現這個現象。開始代碼Review，發現在有一處邏輯出現了問題。

沒有從本地緩存讀取，而是直接從Redis中獲取了數據，Redis最大響應時間也確實發現了不合理的突刺，繼續分析發現Redis響應時間和Dubbo99線突刺情況基本一致，感覺此時已經找到了問題的原因，心中暗喜。

Redis請求流量

服務接口請求流量

Dubbo99線

Redis最大響應時間

3.2.3 獲取兜底數據本地緩存無數據，Redis回源

正常

3.2.4 記錄請求結果入Redis

因為當前Redis做了資源隔離，且未在DB後台查詢到慢日誌，此時分析導致Redis變慢的原因有很多，不過其他的都被主觀忽略了，注意力都在請求Redis流量翻倍的問題上了，故優先解決3.2.2中的問題。

四、解決方案

4.1 3.3.2中定位的問題上線

上線前Redis請求量

上線後Redis請求量

上線後Redis流量翻倍問題得到解決，Redis最大響應時間突刺有所緩解，但依舊沒能徹底解決，說明大流量查詢不是最根本的原因。

redis最大響應時間（上線前）

redis最大響應時間（上線後）

4.2 Redis擴容

在Redis異常流量問題解決後，問題並未得到徹底解決，此時能做的就是靜下心來，仔細去梳理導致Redis慢的原因，思路主要從以下三個方面:

出現了慢查詢
Redis服務出現性能瓶頸
客戶端配置不合理

基於以上思路，一個個的進行排查；查詢Redis慢查詢日誌，未發現慢查詢。

借用調用鏈平台詳細分析慢的Redis命令，沒有了大流量導致的慢查詢的干擾，問題定位流程很快，大量的耗時請求在setex方法上，偶爾出現查詢的慢請求也都是在setex方法之後，根據Redis單線程的特性判斷setex是Redis99線突刺的元兇。找到具體語句，定位到具體業務後，首先申請擴容Redis，由6個master擴到8個master。

Redis擴容前

Redis擴容後

從結果上看，擴容基本上沒有效果，說明redis服務本身不是性能瓶頸點，此時剩下的一個就是客戶端相關配置了。

4.3 客戶端參數優化

4.3.1 連接池優化

Redis擴容沒有效果，針對客戶端可能出現的問題，此時懷疑的點有兩個方向。

第一個是客戶端在處理Redis集群模式時，對連接的管理上存在BUG，第二個是連接池參數設置不合理，此時源碼分析和連接池參數調整同步進行。

4.3.1.1 判斷客戶端連接管理上是否有BUG

在分析完，客戶端處理連接池的源碼後，沒有問題，跟預想一致，按照槽位緩存連接池，第一個假設被排除，源碼如下。

1、setEx
  public String setex(final byte[] key, final int seconds, final byte[] value) {
    return new JedisClusterCommand<String>(connectionHandler, maxAttempts) {
      @Override
      public String execute(Jedis connection) {
        return connection.setex(key, seconds, value);
      }
    }.runBinary(key);
  }
 
2、runBinary
  public T runBinary(byte[] key) {
    if (key == null) {
      throw new JedisClusterException("No way to dispatch this command to Redis Cluster.");
    }
 
    return runWithRetries(key, this.maxAttempts, false, false);
  }
3、runWithRetries
  private T runWithRetries(byte[] key, int attempts, boolean tryRandomNode, boolean asking) {
    if (attempts <= 0) {
      throw new JedisClusterMaxRedirectionsException("Too many Cluster redirections?");
    }
 
    Jedis connection = null;
    try {
 
      if (asking) {
        // TODO: Pipeline asking with the original command to make it
        // faster....
        connection = askConnection.get();
        connection.asking();
 
        // if asking success, reset asking flag
        asking = false;
      } else {
        if (tryRandomNode) {
          connection = connectionHandler.getConnection();
        } else {
          connection = connectionHandler.getConnectionFromSlot(JedisClusterCRC16.getSlot(key));
        }
      }
 
      return execute(connection);
 
    }
 
4、getConnectionFromSlot
  public Jedis getConnectionFromSlot(int slot) {
    JedisPool connectionPool = cache.getSlotPool(slot);
    if (connectionPool != null) {
      // It can't guaranteed to get valid connection because of node
      // assignment
      return connectionPool.getResource();
    } else {
      renewSlotCache(); //It's abnormal situation for cluster mode, that we have just nothing for slot, try to rediscover state
      connectionPool = cache.getSlotPool(slot);
      if (connectionPool != null) {
        return connectionPool.getResource();
      } else {
        //no choice, fallback to new connection to random node
        return getConnection();
      }
    }
  }

4.3.1.2 分析連接池參數

通過跟中間件團隊溝通，以及參考commons-pool2官方文檔修改如下；

參數調整後，1S以上的請求量得到減少，但還是存在，上游反饋降級量由每天90萬左右降到每天6W個（關於maxWaitMillis設置為200MS後為什麼還會有超過200MS的請求，下文有解釋）。

參數優化後Reds最大響應時間

參數優化後接口報錯量

4.3.2 持續優化

優化不能停止，如何把Redis的所有寫入請求降低到200MS以內，此時的優化思路還是調整客戶端配置參數，分析Jedis獲取連接相關源碼；

Jedis獲取連接源碼

final AbandonedConfig ac = this.abandonedConfig;
if (ac != null && ac.getRemoveAbandonedOnBorrow() &&
        (getNumIdle() < 2) &&
        (getNumActive() > getMaxTotal() - 3) ) {
    removeAbandoned(ac);
}

PooledObject<T> p = null;

// Get local copy of current config so it is consistent for entire
// method execution
final boolean blockWhenExhausted = getBlockWhenExhausted();

boolean create;
final long waitTime = System.currentTimeMillis();

while (p == null) {
    create = false;
    p = idleObjects.pollFirst();
    if (p == null) {
        p = create();
        if (p != null) {
            create = true;
        }
    }
    if (blockWhenExhausted) {
        if (p == null) {
            if (borrowMaxWaitMillis < 0) {
                p = idleObjects.takeFirst();
            } else {
                p = idleObjects.pollFirst(borrowMaxWaitMillis,
                        TimeUnit.MILLISECONDS);
            }
        }
        if (p == null) {
            throw new NoSuchElementException(
                    "Timeout waiting for idle object");
        }
    } else {
        if (p == null) {
            throw new NoSuchElementException("Pool exhausted");
        }
    }
    if (!p.allocate()) {
        p = null;
    }

    if (p != null) {
        try {
            factory.activateObject(p);
        } catch (final Exception e) {
            try {
                destroy(p);
            } catch (final Exception e1) {
                // Ignore - activation failure is more important
            }
            p = null;
            if (create) {
                final NoSuchElementException nsee = new NoSuchElementException(
                        "Unable to activate object");
                nsee.initCause(e);
                throw nsee;
            }
        }
        if (p != null && (getTestOnBorrow() || create && getTestOnCreate())) {
            boolean validate = false;
            Throwable validationThrowable = null;
            try {
                validate = factory.validateObject(p);
            } catch (final Throwable t) {
                PoolUtils.checkRethrow(t);
                validationThrowable = t;
            }
            if (!validate) {
                try {
                    destroy(p);
                    destroyedByBorrowValidationCount.incrementAndGet();
                } catch (final Exception e) {
                    // Ignore - validation failure is more important
                }
                p = null;
                if (create) {
                    final NoSuchElementException nsee = new NoSuchElementException(
                            "Unable to validate object");
                    nsee.initCause(validationThrowable);
                    throw nsee;
                }
            }
        }
    }
}

updateStatsBorrow(p, System.currentTimeMillis() - waitTime);

return p.getObject();

獲取連接的大致流程如下:

是否有空閑連接，有空閑連接就直接返回，沒有就創建；

創建時如果超出最大連接數，則判斷是否有其他線程在創建連接，如果沒則直接返回，如果有則等待maxWaitMis時間（其他線程可能創建失敗），如果未超出最大連接，則執行創建連接操作（此時獲取連接等待時間可能會大於maxWaitMs）。

如果創建不成功，則判斷是否是阻塞獲取連接，如果不是則直接拋出異常，連接池不夠用，如果是則判斷maxWaitMillis是否小於0，如果小於0則阻塞等待，如果大於0則阻塞等待maxWaitMillis。

後續就是根據參數來判斷是否需要做連接check等。

根據以上流程分析，maxWaitMills目前設置的為200，以上流程加起來最大阻塞時間為400MS，大部分情況為200MS，不應該出現超出400MS的突刺。

此時問題可能出現在創建連接上，因為創建連接比較耗時，且創建時間不定，重點分析是否有這個場景，通過DB後台監控Redis連接情況。

DB後台監控Redis服務連接

分析上圖發現，確實在幾個時間點（9:00,12:00，19:00…），redis連接數存在上漲情況，跟Redis突刺時間基本吻合。感覺（之前的各種嘗試後，已經不敢用確定了）問題到此定位清晰（在突增流量過來時，連接池可用連接滿足不了需求，會創建連接，造成請求等待）。

此時的想法是在服務啟動時就進行連接池的創建，盡量減少新連接的創建，修改連接池參數vivo.cache.depend.common.poolConfig.minIdle，結果竟然無效？？？

啥都不說了，開始擼源碼，jedis底層使用的是commons-poll2來管理連接的，查看項目中使用的commons-pool2-2.6.2.jar部分源碼；

CommonPool2源碼

public GenericObjectPool(final PooledObjectFactory<T> factory,
        final GenericObjectPoolConfig<T> config) {
 
    super(config, ONAME_BASE, config.getJmxNamePrefix());
 
    if (factory == null) {
        jmxUnregister(); // tidy up
        throw new IllegalArgumentException("factory may not be null");
    }
    this.factory = factory;
 
    idleObjects = new LinkedBlockingDeque<>(config.getFairness());
 
    setConfig(config);
}

竟然發現沒有初始化連接的地方，開始諮詢中間件團隊，中間件團隊給出的源碼（commons-pool2-2.4.2.jar）如下，方法執行後多了一次startEvictor方法的調用？

1、初始化連接池
public GenericObjectPool(PooledObjectFactory<T> factory,
            GenericObjectPoolConfig config) {
super(config, ONAME_BASE, config.getJmxNamePrefix());
if (factory == null) {
            jmxUnregister(); // tidy up
throw new IllegalArgumentException("factory may not be null");
        }
this.factory = factory;
        idleObjects = new LinkedBlockingDeque<PooledObject<T>>(config.getFairness());
        setConfig(config);
        startEvictor(getTimeBetweenEvictionRunsMillis());
    }

為啥不一樣？？？開始檢查Jar包，版本不一樣，中間件給出的版本是在V2.4.2，項目實際使用的是V2.6.2，分析startEvictor有一步邏輯正是處理連接池預熱邏輯。

Jedis連接池預熱

1、final void startEvictor(long delay) {
        synchronized (evictionLock) {
            if (null != evictor) {
                EvictionTimer.cancel(evictor);
                evictor = null;
                evictionIterator = null;
            }
            if (delay > 0) {
                evictor = new Evictor();
                EvictionTimer.schedule(evictor, delay, delay);
            }
        }
    }
2、class Evictor extends TimerTask {
       /**
         * Run pool maintenance.  Evict objects qualifying for eviction and then
         * ensure that the minimum number of idle instances are available.
         * Since the Timer that invokes Evictors is shared for all Pools but
         * pools may exist in different class loaders, the Evictor ensures that
         * any actions taken are under the class loader of the factory
         * associated with the pool.
         */
        @Override
        public void run() {
            ClassLoader savedClassLoader =
                    Thread.currentThread().getContextClassLoader();
            try {
                if (factoryClassLoader != null) {
                    // Set the class loader for the factory
                    ClassLoader cl = factoryClassLoader.get();
                    if (cl == null) {
                        // The pool has been dereferenced and the class loader
                        // GC'd. Cancel this timer so the pool can be GC'd as
                        // well.
                        cancel();
                        return;
                    }
                    Thread.currentThread().setContextClassLoader(cl);
                }
 
                // Evict from the pool
                try {
                    evict();
                } catch(Exception e) {
                    swallowException(e);
                } catch(OutOfMemoryError oome) {
                    // Log problem but give evictor thread a chance to continue
                    // in case error is recoverable
                    oome.printStackTrace(System.err);
                }
                // Re-create idle instances.
                try {
                    ensureMinIdle();
                } catch (Exception e) {
                    swallowException(e);
                }
            } finally {
                // Restore the previous CCL
                Thread.currentThread().setContextClassLoader(savedClassLoader);
            }
        }
    }
3、 void ensureMinIdle() throws Exception {
        ensureIdle(getMinIdle(), true);
    }
4、 private void ensureIdle(int idleCount, boolean always) throws Exception {
        if (idleCount < 1 || isClosed() || (!always && !idleObjects.hasTakeWaiters())) {
            return;
        }
 
        while (idleObjects.size() < idleCount) {
            PooledObject<T> p = create();
            if (p == null) {
                // Can't create objects, no reason to think another call to
                // create will work. Give up.
                break;
            }
            if (getLifo()) {
                idleObjects.addFirst(p);
            } else {
                idleObjects.addLast(p);
            }
        }
        if (isClosed()) {
            // Pool closed while object was being added to idle objects.
            // Make sure the returned object is destroyed rather than left
            // in the idle object pool (which would effectively be a leak)
            clear();
        }
    }

修改Jar版本，配置中心增加vivo.cache.depend.common.poolConfig.timeBetweenEvictionRunsMillis（檢查一次連接池中空閑的連接，把空閑時間超過minEvictableIdleTimeMillis毫秒的連接斷開,直到連接池中的連接數到minIdle為止）。

vivo.cache.depend.common.poolConfig.minEvictableIdleTimeMillis（連接池中連接可空閑的時間,毫秒）兩個參數，重啟服務後，連接池正常預熱，最終從Redis層面上解決問題。

優化結果如下，性能問題基本得到解決；

Redis響應時間（優化前）

Redis響應時間（優化後）

接口99線（優化前）

接口99線（優化後）

五、總結

出現線上問題時，首先要考慮的還是快速恢複線上業務，將業務的影響度降到最低，所以針對線上的業務，要提前做好限流、熔斷、降級等策略，在線上出現問題時能快速找到恢復方案。對公司各監控平台的熟練使用程度，決定了定位問題的速度，每個開發都要把熟練使用監控平台（機器、服務、接口、DB等）作為一個基本能力。

Redis出現響應慢時，可以優先從Redis集群服務端（機器負載、服務是否有慢查詢）、業務代碼（是否有BUG）、客戶端（連接池配置是否合理）三個方面去排查，基本上能排查出大部分Redis慢響應問題。

Redis連接池在系統冷啟動時，對連接池的預熱，不同commons-pool2的版本，冷啟動的策略也不同，但都需要配置minEvictableIdleTimeMillis參數才會生效，可以看下common-pool2官方文檔，對常用參數都做到心中有數，在問題出現時能快速定位。

連接池默認參數在解決大流量的業務上稍顯乏力，需要針對大流量場景進行調優處理，如果業務上流量不是很大直接使用默認參數即可。

具體問題要具體分析，不能解決問題的時候要變通思路，通過各種方法去嘗試解決問題。

作者：vivo互聯網服務器團隊-Wang Shaodong

Tags: Jedis Redis 數據庫連接池調用鏈

十億級流量下，我與Redis時延小突刺的戰鬥史

一、背景

二、快速應急

2.1 快速定位

三、尋找性能瓶頸點

3.1 接口流程分析

3.1.1 流程圖

3.1.2 流程分析

3.2 性能瓶頸點排查

3.2.1 下游接口服務耗時比較長

3.2.2 獲取詳情本地緩存無數據，Redis回源

3.2.3 獲取兜底數據本地緩存無數據，Redis回源

3.2.4 記錄請求結果入Redis

四、解決方案

4.1 3.3.2中定位的問題上線

4.2 Redis擴容

4.3 客戶端參數優化

4.3.1 連接池優化

4.3.1.1 判斷客戶端連接管理上是否有BUG

4.3.2 持續優化

五、總結

VirMach 便宜 VPS

QNews

十億級流量下，我與Redis時延小突刺的戰鬥史

一、背景

二、快速應急

2.1 快速定位

三、尋找性能瓶頸點

3.1 接口流程分析

3.1.1 流程圖

3.1.2 流程分析

3.2 性能瓶頸點排查

3.2.1 下游接口服務耗時比較長

3.2.2 獲取詳情本地緩存無數據，Redis回源

3.2.3 獲取兜底數據本地緩存無數據，Redis回源

3.2.4 記錄請求結果入Redis

四、解決方案

4.1 3.3.2中定位的問題上線

4.2 Redis擴容

4.3 客戶端參數優化

4.3.1 連接池優化

4.3.1.1 判斷客戶端連接管理上是否有BUG

4.3.2 持續優化

五、總結

分享此文：

Related Posts

『居善地』接口測試 — 9、接口自動化框架的發送郵件實現

fasthttp：比net/http快十倍的Go框架(server 篇)

【TCP/IP】TCP詳解筆記

『言善信』Fiddler工具 — 16、使用Fiddler抓取移動端App請求

VirMach 便宜 VPS

QNews

熱門搜尋