詳細(xì)分析Redis集群故障-創(chuàng)新互聯(lián)

故障表象：

創(chuàng)新互聯(lián)是一家集網(wǎng)站建設(shè),渝水企業(yè)網(wǎng)站建設(shè),渝水品牌網(wǎng)站建設(shè),網(wǎng)站定制,渝水網(wǎng)站建設(shè)報(bào)價(jià),網(wǎng)絡(luò)營銷,網(wǎng)絡(luò)優(yōu)化,渝水網(wǎng)站推廣為一體的創(chuàng)新建站企業(yè)，幫助傳統(tǒng)企業(yè)提升企業(yè)形象加強(qiáng)企業(yè)競爭力?？沙浞譂M足這一群體相比中小企業(yè)更為豐富、高端、多元的互聯(lián)網(wǎng)需求。同時(shí)我們時(shí)刻保持專業(yè)、時(shí)尚、前沿，時(shí)刻以成就客戶成長自我，堅(jiān)持不斷學(xué)習(xí)、思考、沉淀、凈化自己，讓我們?yōu)楦嗟钠髽I(yè)打造出實(shí)用型網(wǎng)站。

業(yè)務(wù)層面顯示提示查詢r(jià)edis失敗

集群組成：

3主3從，每個(gè)節(jié)點(diǎn)的數(shù)據(jù)有8GB

機(jī)器分布：

在同一個(gè)機(jī)架中，

xx.x.xxx.199
xx.x.xxx.200
xx.x.xxx.201

redis-server進(jìn)程狀態(tài)：

通過命令ps -eo pid,lstart | grep $pid，

發(fā)現(xiàn)進(jìn)程已經(jīng)持續(xù)運(yùn)行了3個(gè)月

發(fā)生故障前集群的節(jié)點(diǎn)狀態(tài)：

xx.x.xxx.200:8371(bedab2c537fe94f8c0363ac4ae97d56832316e65) master
xx.x.xxx.199:8373(792020fe66c00ae56e27cd7a048ba6bb2b67adb6) slave
xx.x.xxx.201:8375(5ab4f85306da6d633e4834b4d3327f45af02171b) master
xx.x.xxx.201:8372(826607654f5ec81c3756a4a21f357e644efe605a) slave
xx.x.xxx.199:8370(462cadcb41e635d460425430d318f2fe464665c5) master
xx.x.xxx.200:8374(1238085b578390f3c8efa30824fd9a4baba10ddf) slave

---------------------------------下面是日志分析--------------------------------------

步1：
主節(jié)點(diǎn)8371失去和從節(jié)點(diǎn)8373的連接：
46590:M 09 Sep 18:57:51.379 # Connection with slave xx.x.xxx.199:8373 lost.

步2：
主節(jié)點(diǎn)8370/8375判定8371失聯(lián)：
42645:M 09 Sep 18:57:50.117 * Marking node bedab2c537fe94f8c0363ac4ae97d56832316e65 as failing (quorum reached).

步3：
從節(jié)點(diǎn)8372/8373/8374收到主節(jié)點(diǎn)8375說8371失聯(lián)：
46986:S 09 Sep 18:57:50.120 * FAIL message received from 5ab4f85306da6d633e4834b4d3327f45af02171b about bedab2c537fe94f8c0363ac4ae97d56832316e65

步4：
主節(jié)點(diǎn)8370/8375授權(quán)8373升級(jí)為主節(jié)點(diǎn)轉(zhuǎn)移：
42645:M 09 Sep 18:57:51.055 # Failover auth granted to 792020fe66c00ae56e27cd7a048ba6bb2b67adb6 for epoch 16

步5：
原主節(jié)點(diǎn)8371修改自己的配置，成為8373的從節(jié)點(diǎn)：
46590:M 09 Sep 18:57:51.488 # Configuration change detected. Reconfiguring myself as a replica of 792020fe66c00ae56e27cd7a048ba6bb2b67adb6

步6：
主節(jié)點(diǎn)8370/8375/8373明確8371失敗狀態(tài)：
42645:M 09 Sep 18:57:51.522 * Clear FAIL state for node bedab2c537fe94f8c0363ac4ae97d56832316e65: master without slots is reachable again.

步7：
新從節(jié)點(diǎn)8371開始從新主節(jié)點(diǎn)8373，第一次全量同步數(shù)據(jù)：
8373日志：：
4255:M 09 Sep 18:57:51.906 * Full resync requested by slave xx.x.xxx.200:8371
4255:M 09 Sep 18:57:51.906 * Starting BGSAVE for SYNC with target: disk
4255:M 09 Sep 18:57:51.941 * Background saving started by pid 5230
8371日志：：
46590:S 09 Sep 18:57:51.948 * Full resync from master: d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440721826993

步8：
主節(jié)點(diǎn)8370/8375判定8373(新主)失聯(lián)：
42645:M 09 Sep 18:58:00.320 * Marking node 792020fe66c00ae56e27cd7a048ba6bb2b67adb6 as failing (quorum reached).

步9：
主節(jié)點(diǎn)8370/8375判定8373(新主)恢復(fù)：
60295:M 09 Sep 18:58:18.181 * Clear FAIL state for node 792020fe66c00ae56e27cd7a048ba6bb2b67adb6: is reachable again and nobody is serving its slots after some time.

步10：
主節(jié)點(diǎn)8373完成全量同步所需要的BGSAVE操作：
5230:C 09 Sep 18:59:01.474 * DB saved on disk
5230:C 09 Sep 18:59:01.491 * RDB: 7112 MB of memory used by copy-on-write
4255:M 09 Sep 18:59:01.877 * Background saving terminated with success

步11：
從節(jié)點(diǎn)8371開始從主節(jié)點(diǎn)8373接收到數(shù)據(jù)：
46590:S 09 Sep 18:59:02.263 * MASTER <-> SLAVE sync: receiving 2657606930 bytes from master

步12：
主節(jié)點(diǎn)8373發(fā)現(xiàn)從節(jié)點(diǎn)8371對(duì)output buffer作了限制：
4255:M 09 Sep 19:00:19.014 # Client id=14259015 addr=xx.x.xxx.200:21772 fd=844 name= age=148 idle=148 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16349 oll=4103 omem=95944066 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.
4255:M 09 Sep 19:00:19.015 # Connection with slave xx.x.xxx.200:8371 lost.

步13：
從節(jié)點(diǎn)8371從主節(jié)點(diǎn)8373同步數(shù)據(jù)失敗，連接斷了，第一次全量同步失敗：
46590:S 09 Sep 19:00:19.018 # I/O error trying to sync with MASTER: connection lost
46590:S 09 Sep 19:00:20.102 * Connecting to MASTER xx.x.xxx.199:8373
46590:S 09 Sep 19:00:20.102 * MASTER <-> SLAVE sync started

步14：
從節(jié)點(diǎn)8371重新開始同步，連接失敗，主節(jié)點(diǎn)8373的連接數(shù)滿了：
46590:S 09 Sep 19:00:21.103 * Connecting to MASTER xx.x.xxx.199:8373
46590:S 09 Sep 19:00:21.103 * MASTER <-> SLAVE sync started
46590:S 09 Sep 19:00:21.104 * Non blocking connect for SYNC fired the event.
46590:S 09 Sep 19:00:21.104 # Error reply to PING from master: '-ERR max number of clients reached'

步15：
從節(jié)點(diǎn)8371重新連上主節(jié)點(diǎn)8373，第二次開始全量同步：
8371日志：
46590:S 09 Sep 19:00:49.175 * Connecting to MASTER xx.x.xxx.199:8373
46590:S 09 Sep 19:00:49.175 * MASTER <-> SLAVE sync started
46590:S 09 Sep 19:00:49.175 * Non blocking connect for SYNC fired the event.
46590:S 09 Sep 19:00:49.176 * Master replied to PING, replication can continue...
46590:S 09 Sep 19:00:49.179 * Partial resynchronization not possible (no cached master)
46590:S 09 Sep 19:00:49.501 * Full resync from master: d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440780763454
8373日志：
4255:M 09 Sep 19:00:49.176 * Slave xx.x.xxx.200:8371 asks for synchronization
4255:M 09 Sep 19:00:49.176 * Full resync requested by slave xx.x.xxx.200:8371
4255:M 09 Sep 19:00:49.176 * Starting BGSAVE for SYNC with target: disk
4255:M 09 Sep 19:00:49.498 * Background saving started by pid 18413
18413:C 09 Sep 19:01:52.466 * DB saved on disk
18413:C 09 Sep 19:01:52.620 * RDB: 2124 MB of memory used by copy-on-write
4255:M 09 Sep 19:01:53.186 * Background saving terminated with success

步16：
從節(jié)點(diǎn)8371同步數(shù)據(jù)成功，開始加載經(jīng)內(nèi)存：
46590:S 09 Sep 19:01:53.190 * MASTER <-> SLAVE sync: receiving 2637183250 bytes from master
46590:S 09 Sep 19:04:51.485 * MASTER <-> SLAVE sync: Flushing old data
46590:S 09 Sep 19:05:58.695 * MASTER <-> SLAVE sync: Loading DB in memory

步17：
集群恢復(fù)正常：
42645:M 09 Sep 19:05:58.786 * Clear FAIL state for node bedab2c537fe94f8c0363ac4ae97d56832316e65: slave is reachable again.

步18：
從節(jié)點(diǎn)8371同步數(shù)據(jù)成功，耗時(shí)7分鐘：
46590:S 09 Sep 19:08:19.303 * MASTER <-> SLAVE sync: Finished with success

8371失聯(lián)原因分析：

由于幾臺(tái)機(jī)器在同一個(gè)機(jī)架，不太可能發(fā)生網(wǎng)絡(luò)中斷的情況，于是通過SLOWLOG GET命令查看了慢查詢?nèi)罩荆l(fā)現(xiàn)有一個(gè)KEYS命令被執(zhí)行了，耗時(shí)8.3秒，再查看集群節(jié)點(diǎn)超時(shí)設(shè)置，發(fā)現(xiàn)是5s(cluster-node-timeout 5000)

出現(xiàn)節(jié)點(diǎn)失聯(lián)的原因：

客戶端執(zhí)行了耗時(shí)1條8.3s的命令，

2016/9/9 18:57:43 開始執(zhí)行KEYS命令
2016/9/9 18:57:50 8371被判斷失聯(lián)（redis日志）
2016/9/9 18:57:51 執(zhí)行完KEYS命令

總結(jié)來說，有以下幾個(gè)問題：

1.由于cluster-node-timeout設(shè)置比較短，慢查詢KEYS導(dǎo)致了集群判斷節(jié)點(diǎn)8371失聯(lián)

2.由于8371失聯(lián)，導(dǎo)致8373升級(jí)為主，開始主從同步

3.由于配置client-output-buffer-limit的限制，導(dǎo)致第一次全量同步失敗了

4.又由于PHP客戶端的連接池有問題，瘋狂連接服務(wù)器，產(chǎn)生了類似SYN攻擊的效果

5.第一次全量同步失敗后，從節(jié)點(diǎn)重連主節(jié)點(diǎn)花了30秒（超過了大連接數(shù)1w）

關(guān)于client-output-buffer-limit參數(shù)：

# The syntax of every client-output-buffer-limit directive is the following: 
# 
# client-output-buffer-limit <class> <hard limit> <soft limit> <soft seconds> 
# 
# A client is immediately disconnected once the hard limit is reached, or if 
# the soft limit is reached and remains reached for the specified number of 
# seconds (continuously). 
# So for instance if the hard limit is 32 megabytes and the soft limit is 
# 16 megabytes / 10 seconds, the client will get disconnected immediately 
# if the size of the output buffers reach 32 megabytes, but will also get 
# disconnected if the client reaches 16 megabytes and continuously overcomes 
# the limit for 10 seconds. 
# 
# By default normal clients are not limited because they don't receive data 
# without asking (in a push way), but just after a request, so only 
# asynchronous clients may create a scenario where data is requested faster 
# than it can read. 
# 
# Instead there is a default limit for pubsub and slave clients, since 
# subscribers and slaves receive data in a push fashion. 
# 
# Both the hard or the soft limit can be disabled by setting them to zero. 
client-output-buffer-limit normal 0 0 0 
client-output-buffer-limit slave 256mb 64mb 60 
client-output-buffer-limit pubsub 32mb 8mb 60

另外有需要云服務(wù)器可以了解下創(chuàng)新互聯(lián)scvps.cn，海內(nèi)外云服務(wù)器15元起步，三天無理由+7*72小時(shí)售后在線，公司持有idc許可證，提供“云服務(wù)器、裸金屬服務(wù)器、高防服務(wù)器、香港服務(wù)器、美國服務(wù)器、虛擬主機(jī)、免備案服務(wù)器”等云主機(jī)租用服務(wù)以及企業(yè)上云的綜合解決方案，具有“安全穩(wěn)定、簡單易用、服務(wù)可用性高、性價(jià)比高”等特點(diǎn)與優(yōu)勢(shì)，專為企業(yè)上云打造定制，能夠滿足用戶豐富、多元化的應(yīng)用場景需求。

分享名稱：詳細(xì)分析Redis集群故障-創(chuàng)新互聯(lián)
文章鏈接：http://muchs.cn/article0/dcjgio.html

成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián)，為您提供網(wǎng)站內(nèi)鏈、做網(wǎng)站、企業(yè)建站、網(wǎng)站設(shè)計(jì)公司、定制網(wǎng)站、域名注冊(cè)

聲明：本網(wǎng)站發(fā)布的內(nèi)容（圖片、視頻和文字）以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主，如果涉及侵權(quán)請(qǐng)盡快告知，我們將會(huì)在第一時(shí)間刪除。文章觀點(diǎn)不代表本網(wǎng)站立場，如需處理請(qǐng)聯(lián)系客服。電話：028-86922220；郵箱：631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載，或轉(zhuǎn)載時(shí)需注明來源：創(chuàng)新互聯(lián)

猜你還喜歡下面的內(nèi)容