本期內(nèi)容:
創(chuàng)新互聯(lián)建站是一家專注于網(wǎng)站設(shè)計制作、網(wǎng)站設(shè)計與策劃設(shè)計,呼中網(wǎng)站建設(shè)哪家好?創(chuàng)新互聯(lián)建站做網(wǎng)站,專注于網(wǎng)站建設(shè)十多年,網(wǎng)設(shè)計領(lǐng)域的專業(yè)建站公司;建站業(yè)務涵蓋:呼中等地區(qū)。呼中做網(wǎng)站價格咨詢:18980820575
1、Spark Streaming元數(shù)據(jù)清理詳解
2、Spark Streaming元數(shù)據(jù)清理源碼解析
一、如何研究Spark Streaming元數(shù)據(jù)清理
操作DStream的時候會產(chǎn)生元數(shù)據(jù),所以要解決RDD的數(shù)據(jù)清理工作就一定要從DStream入手。因為DStream是RDD的模板,DStream之間有依賴關(guān)系。
DStream的操作產(chǎn)生了RDD,接收數(shù)據(jù)也靠DStream,數(shù)據(jù)的輸入,數(shù)據(jù)的計算,輸出整個生命周期都是由DStream構(gòu)建的。由此,DStream負責RDD的整個生命周期。因此研究的入口的是DStream。
基于Kafka數(shù)據(jù)來源,通過Direct的方式訪問Kafka,DStream隨著時間的進行,會不斷的在自己的內(nèi)存數(shù)據(jù)結(jié)構(gòu)中維護一個HashMap,HashMap維護的就是時間窗口,以及時間窗口下的RDD.按照Batch Duration來存儲RDD以及刪除RDD.
Spark Streaming本身是一直在運行的,在自己計算的時候會不斷的產(chǎn)生RDD,例如每秒Batch Duration都會產(chǎn)生RDD,除此之外可能還有累加器,廣播變量。由于不斷的產(chǎn)生這些對象,因此Spark Streaming有自己的一套對象,元數(shù)據(jù)以及數(shù)據(jù)的清理機制。
Spark Streaming對RDD的管理就相當于JVM的GC
二、源碼解析
Spark Streaming是通過我們設(shè)定的Batch Durations來不斷的產(chǎn)生RDD,Spark Streaming清理元數(shù)據(jù)跟時鐘有關(guān),因為數(shù)據(jù)是周期性的產(chǎn)生,所以肯定是周期性的釋放,這些都跟JobGenerator有關(guān),所以我們先從這開始研究。
1、RecurringTimer: 消息循環(huán)器將消息不斷的發(fā)送給EventLoop
= RecurringTimer(...millisecondslongTime => .post((Time(longTime))))
2、eventLoop:onReceive接收到消息
(): = synchronized { (!= ) = EventLoop[JobGeneratorEvent]() { (event: JobGeneratorEvent): = processEvent(event) (e: ): = { jobScheduler.reportError(e) } } .start() (.) { restart() } { startFirstTime() } }
3、在processEvent中接收清理元數(shù)據(jù)消息
/** Processes all events */ private def processEvent(event: JobGeneratorEvent) { logDebug("Got event " + event) event match { case GenerateJobs(time) => generateJobs(time) case ClearMetadata(time) => clearMetadata(time) //清理元數(shù)據(jù) case DoCheckpoint(time, clearCheckpointDataLater) => doCheckpoint(time, clearCheckpointDataLater) case ClearCheckpointData(time) => clearCheckpointData(time) //清理checkpoint } }
具體的方法實現(xiàn)內(nèi)容就不再這里說,我們進一步分析下這些清理動作是在什么時候被調(diào)用的,在Spark Streaming應用程序中,最終Job是交給JobHandler來執(zhí)行的,所以我們分析下JobHandler
private class JobHandler(job: Job) extends Runnable with Logging { import JobScheduler._ def run() { try { val formattedTime = UIUtils.formatBatchTime( job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false) val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}" val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]" ssc.sc.setJobDescription( s"""Streaming job from <a href="$batchUrl">$batchLinkText</a>""") ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString) ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString) // We need to assign `eventLoop` to a temp variable. Otherwise, because // `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then // it's possible that when `post` is called, `eventLoop` happens to null. var _eventLoop = eventLoop if (_eventLoop != null) { _eventLoop.post(JobStarted(job, clock.getTimeMillis())) // Disable checks for existing output directories in jobs launched by the streaming // scheduler, since we may need to write output to an existing directory during checkpoint // recovery; see SPARK-4835 for more details. PairRDDFunctions.disableOutputSpecValidation.withValue(true) { job.run() } _eventLoop = eventLoop if (_eventLoop != null) { _eventLoop.post(JobCompleted(job, clock.getTimeMillis())) } } else { // JobScheduler has been stopped. } } finally { ssc.sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, null) ssc.sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, null) } } } }
當Job完成的時候,會發(fā)JobCompleted消息給onReceive,通過processEvent來執(zhí)行具體的方法
private def processEvent(event: JobSchedulerEvent) { try { event match { case JobStarted(job, startTime) => handleJobStart(job, startTime) case JobCompleted(job, completedTime) => handleJobCompletion(job, completedTime) case ErrorReported(m, e) => handleError(m, e) } } catch { case e: Throwable => reportError("Error in job scheduler", e) } }
private def handleJobCompletion(job: Job, completedTime: Long) { val jobSet = jobSets.get(job.time) jobSet.handleJobCompletion(job) job.setEndTime(completedTime) listenerBus.post(StreamingListenerOutputOperationCompleted(job.toOutputOperationInfo)) logInfo("Finished job " + job.id + " from job set of time " + jobSet.time) if (jobSet.hasCompleted) { jobSets.remove(jobSet.time) jobGenerator.onBatchCompletion(jobSet.time) logInfo("Total delay: %.3f s for time %s (execution: %.3f s)".format( jobSet.totalDelay / 1000.0, jobSet.time.toString, jobSet.processingDelay / 1000.0 )) listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo)) } job.result match { case Failure(e) => reportError("Error running job " + job, e) case _ => } }
通過jobGenerator.onBatchCompletion來清理元數(shù)據(jù)
/** * Callback called when a batch has been completely processed. */ def onBatchCompletion(time: Time) { eventLoop.post(ClearMetadata(time)) }
到這里Spark Streaming清理元數(shù)據(jù)的步驟基本上完成了
網(wǎng)站名稱:(版本定制)第16課:SparkStreaming源碼解讀之數(shù)據(jù)清理內(nèi)幕徹底解密
網(wǎng)站鏈接:http://muchs.cn/article18/ighjdp.html
成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián),為您提供標簽優(yōu)化、品牌網(wǎng)站建設(shè)、品牌網(wǎng)站制作、面包屑導航、商城網(wǎng)站、外貿(mào)網(wǎng)站建設(shè)
聲明:本網(wǎng)站發(fā)布的內(nèi)容(圖片、視頻和文字)以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主,如果涉及侵權(quán)請盡快告知,我們將會在第一時間刪除。文章觀點不代表本網(wǎng)站立場,如需處理請聯(lián)系客服。電話:028-86922220;郵箱:631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載,或轉(zhuǎn)載時需注明來源: 創(chuàng)新互聯(lián)