網(wǎng)絡(luò)故障引起的kafka自身的BUG該怎么解決

網(wǎng)絡(luò)故障引起的kafka自身的BUG該怎么解決，相信很多沒(méi)有經(jīng)驗(yàn)的人對(duì)此束手無(wú)策，為此本文總結(jié)了問(wèn)題出現(xiàn)的原因和解決方法，通過(guò)這篇文章希望你能解決這個(gè)問(wèn)題。

成都創(chuàng)新互聯(lián)公司服務(wù)項(xiàng)目包括新建網(wǎng)站建設(shè)、新建網(wǎng)站制作、新建網(wǎng)頁(yè)制作以及新建網(wǎng)絡(luò)營(yíng)銷策劃等。多年來(lái)，我們專注于互聯(lián)網(wǎng)行業(yè)，利用自身積累的技術(shù)優(yōu)勢(shì)、行業(yè)經(jīng)驗(yàn)、深度合作伙伴關(guān)系等，向廣大中小型企業(yè)、政府機(jī)構(gòu)等提供互聯(lián)網(wǎng)行業(yè)的解決方案，新建網(wǎng)站推廣取得了明顯的社會(huì)效益與經(jīng)濟(jì)效益。目前，我們服務(wù)的客戶以成都為中心已經(jīng)輻射到新建省份的部分城市，未來(lái)相信會(huì)繼續(xù)擴(kuò)大服務(wù)區(qū)域并繼續(xù)獲得客戶的支持與信任！

2019-09-03 17:06:25

機(jī)房網(wǎng)絡(luò)出現(xiàn)一分鐘波動(dòng)，交換機(jī)問(wèn)題導(dǎo)致kafka集群相互之間偶爾失聯(lián)。

kafka日志如下所示：

[2019-09-03 17:06:25,610] WARN Attempting to send response via channel for which there is no open connection, connection id xxxxx (kafka.network.Processor)
[2019-09-03 17:06:31,906] INFO Unable to read additional data from server sessionid 0x46b0xxxx027, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:32,076] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient)
[2019-09-03 17:06:32,609] INFO Opening socket connection to server xxxxx. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:33,810] WARN Client session timed out, have not heard from server in 1796ms for sessionid 0x46bxxxx40027 (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:33,810] INFO Client session timed out, have not heard from server in 1796ms for sessionid 0x46b03bxxx027, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:34,942] INFO Opening socket connection to server xxxx. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:36,059] INFO [Partition opStaffCancelPost-18 broker=180] Shrinking ISR from 180,182,183 to 180 (kafka.cluster.Partition)
[2019-09-03 17:06:36,059] WARN Client session timed out, have not heard from server in 2092ms for sessionid 0x46b0xxxx027 (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:36,059] INFO Client session timed out, have not heard from server in 2092ms for sessionid 0x46b0xxxx0027, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:36,382] INFO Waiting for keeper state SyncConnected (org.I0Itec.zkclient.ZkClient)
[2019-09-03 17:06:37,305] INFO Opening socket connection to server xxxx. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:38,507] WARN Client session timed out, have not heard from server in 2135ms for sessionid 0x46bxxxx0027 (org.apache.zookeeper.ClientCnxn)

短暫的波動(dòng)網(wǎng)絡(luò)持續(xù)了1分鐘左右，之后網(wǎng)絡(luò)恢復(fù)。

本來(lái)對(duì)于高可用的kafka集群來(lái)說(shuō)應(yīng)該也是可以自動(dòng)恢復(fù)的，但是事與愿違。

接著是 kafka-manager 的監(jiān)控出現(xiàn)異常，大量的 topic 全部都沒(méi)有 broker ，且消息的 offset 不再變化。

手動(dòng)使用 kafka client連接上去看也是很奇怪，看不到消息進(jìn)來(lái)消費(fèi)，但是應(yīng)用生產(chǎn)和消費(fèi)卻是正常的。

在kafka的日志里看到如下：

[2019-09-03 17:11:37,572] INFO [Partition opStaffCancelPost-18 broker=180] Cached zkVersion [50] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-09-03 17:11:37,572] INFO [Partition sendThrowThirdBoxRetry-1 broker=180] Shrinking ISR from 180,182,183 to 180 (kafka.cluster.Partition)
[2019-09-03 17:11:37,574] INFO [Partition sendThrowThirdBoxRetry-1 broker=180] Cached zkVersion [48] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-09-03 17:11:37,574] INFO [Partition __consumer_offsets-42 broker=180] Shrinking ISR from 180,181,182 to 180,181 (kafka.cluster.Partition)
[2019-09-03 17:11:37,576] INFO [Partition __consumer_offsets-42 broker=180] Cached zkVersion [45] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)

第二天發(fā)現(xiàn)監(jiān)控還是這樣，情況有點(diǎn)不對(duì)（這時(shí)候應(yīng)用本身使用kafka生產(chǎn)和消費(fèi)是正常的，但是各種監(jiān)控?cái)?shù)據(jù)卻說(shuō)它異常），那套kafka集群還是處于有問(wèn)題的狀態(tài)，上午11點(diǎn)多開始我們手動(dòng)重啟節(jié)點(diǎn)，這個(gè)時(shí)候再次出現(xiàn)故障，整套kafka集群連接失敗生產(chǎn)不了消息，其中kafka日志如下：

[2019-09-04 11:59:12,862] INFO [Partition routeStaffPostQueue-15 broker=180] Shrinking ISR from 180,182,183 to 180,183 (kafka.cluster.Partition)
[2019-09-04 11:59:12,864] INFO [Partition routeStaffPostQueue-15 broker=180] Cached zkVersion [43] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-09-04 11:59:12,864] INFO [Partition sfPushFvpRetryMsgProcQueue-5 broker=180] Shrinking ISR from 180,182,183 to 180,183 (kafka.cluster.Partition)
[2019-09-04 11:59:12,865] INFO [Partition sfPushFvpRetryMsgProcQueue-5 broker=180] Cached zkVersion [41] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-09-04 11:59:12,866] INFO [Partition openRouteRetryMsgProcQueue-3 broker=180] Shrinking ISR from 180,182,183 to 180 (kafka.cluster.Partition)
[2019-09-04 11:59:12,867] INFO [Partition openRouteRetryMsgProcQueue-3 broker=180] Cached zkVersion [44] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-09-04 11:59:12,867] INFO [Partition routeStaffCancelQueue-5 broker=180] Shrinking ISR from 180,182,183 to 180,183 (kafka.cluster.Partition)
[2019-09-04 11:59:12,870] INFO [Partition routeStaffCancelQueue-5 broker=180] Cached zkVersion [43] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)

原因：

網(wǎng)絡(luò)出現(xiàn)問(wèn)題的時(shí)候，當(dāng)kafka的controller和zk的會(huì)話過(guò)期了且失去了控制權(quán)，這個(gè)時(shí)候這個(gè)僵尸controller在短時(shí)間內(nèi)還在繼續(xù)更新zk和向broker發(fā)送 LeaderAndIsrRequests 。當(dāng)這種情況發(fā)生的時(shí)候，其他的broker還沒(méi)有更新leader信息和isr，導(dǎo)致后續(xù)需要更新的時(shí)候在zk上更新失敗。

kafka官方已經(jīng)確認(rèn)了這個(gè)BUG，且在 KAFKA-5642通過(guò)合適的處理zk會(huì)話過(guò)期事件修復(fù)了這個(gè)問(wèn)題。

我們使用的kakfa是1.0.0，官方的修復(fù)版本是在 1.1.0,所以如果還處于1.1.0以下版本的kafka用戶，一定要注意下這個(gè)問(wèn)題，可以調(diào)整下連接zk的超時(shí)時(shí)間，讓超時(shí)時(shí)間多續(xù)幾秒鐘，要么就升級(jí)kafka版本。

zookeeper.connection.timeout.ms=10000
zookeeper.session.timeout.ms=10000

看完上述內(nèi)容，你們掌握網(wǎng)絡(luò)故障引起的kafka自身的BUG該怎么解決的方法了嗎？如果還想學(xué)到更多技能或想了解更多相關(guān)內(nèi)容，歡迎關(guān)注創(chuàng)新互聯(lián)行業(yè)資訊頻道，感謝各位的閱讀！

名稱欄目：網(wǎng)絡(luò)故障引起的kafka自身的BUG該怎么解決
網(wǎng)址分享：http://muchs.cn/article44/pdgpee.html

成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián)，為您提供網(wǎng)站維護(hù)、移動(dòng)網(wǎng)站建設(shè)、品牌網(wǎng)站設(shè)計(jì)、網(wǎng)站排名、面包屑導(dǎo)航、動(dòng)態(tài)網(wǎng)站

聲明：本網(wǎng)站發(fā)布的內(nèi)容（圖片、視頻和文字）以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主，如果涉及侵權(quán)請(qǐng)盡快告知，我們將會(huì)在第一時(shí)間刪除。文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng)，如需處理請(qǐng)聯(lián)系客服。電話：028-86922220；郵箱：631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載，或轉(zhuǎn)載時(shí)需注明來(lái)源：創(chuàng)新互聯(lián)

猜你還喜歡下面的內(nèi)容