scrapy架構(gòu)初探

scrapy數(shù)據(jù)流

Scrapy中的數(shù)據(jù)流由執(zhí)行引擎控制，下面的原文摘自Scrapy官網(wǎng)，我根據(jù)猜測(cè)做了點(diǎn)評(píng)，為進(jìn)一步開(kāi)發(fā)GooSeeker開(kāi)源爬蟲(chóng)指示方向：
成都創(chuàng)新互聯(lián)公司是少有的成都網(wǎng)站設(shè)計(jì)、成都網(wǎng)站制作、營(yíng)銷(xiāo)型企業(yè)網(wǎng)站、微信小程序定制開(kāi)發(fā)、手機(jī)APP,開(kāi)發(fā)、制作、設(shè)計(jì)、友情鏈接、推廣優(yōu)化一站式服務(wù)網(wǎng)絡(luò)公司,2013年至今,堅(jiān)持透明化,價(jià)格低,無(wú)套路經(jīng)營(yíng)理念。讓網(wǎng)頁(yè)驚喜每一位訪客多年來(lái)深受用戶好評(píng)
The Engine gets the first URLs to crawl from the Spider and schedules them in the Scheduler, as Requests.

URL誰(shuí)來(lái)準(zhǔn)備呢？看樣子是Spider自己來(lái)準(zhǔn)備，那么可以猜測(cè)Scrapy架構(gòu)部分（不包括Spider）主要做事件調(diào)度，不管網(wǎng)址的存儲(chǔ)?？雌饋?lái)類似GooSeeker會(huì)員中心的爬蟲(chóng)羅盤(pán)，為目標(biāo)網(wǎng)站準(zhǔn)備一批網(wǎng)址，放在羅盤(pán)中準(zhǔn)備執(zhí)行爬蟲(chóng)調(diào)度操作。所以，這個(gè)開(kāi)源項(xiàng)目的下一個(gè)目標(biāo)是把URL的管理放在一個(gè)集中的調(diào)度庫(kù)里面。
The Engine asks the Scheduler for the next URLs to crawl.

看到這里其實(shí)挺難理解的，要看一些其他文檔才能理解透。接第1點(diǎn)，引擎從Spider中把網(wǎng)址拿到以后，封裝成一個(gè)Request，交給了事件循環(huán)，會(huì)被Scheduler收來(lái)做調(diào)度管理的，暫且理解成對(duì)Request做排隊(duì)。引擎現(xiàn)在就找Scheduler要接下來(lái)要下載的網(wǎng)頁(yè)地址。
The Scheduler returns the next URLs to crawl to the Engine and the Engine sends them to the Downloader, passing through the Downloader Middleware (request direction).

從調(diào)度器申請(qǐng)任務(wù)，把申請(qǐng)到的任務(wù)交給下載器，在下載器和引擎之間有個(gè)下載器中間件，這是作為一個(gè)開(kāi)發(fā)框架的必備亮點(diǎn)，開(kāi)發(fā)者可以在這里進(jìn)行一些定制化擴(kuò)展。
Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middleware (response direction).

下載完成了，產(chǎn)生一個(gè)Response，通過(guò)下載器中間件交給引擎。注意，Response和前面的Request的首字母都是大寫(xiě)，雖然我還沒(méi)有看其它Scrapy文檔，但是我猜測(cè)這是Scrapy框架內(nèi)部的事件對(duì)象，也可以推測(cè)出是一個(gè)異步的事件驅(qū)動(dòng)的引擎，就像DS打數(shù)機(jī)的三級(jí)事件循環(huán)一樣，對(duì)于高性能、低開(kāi)銷(xiāo)引擎來(lái)說(shuō)，這是必須的。
The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (input direction).

再次出現(xiàn)一個(gè)中間件，給開(kāi)發(fā)者足夠的發(fā)揮空間。
The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine.

每個(gè)Spider順序抓取一個(gè)個(gè)網(wǎng)頁(yè)，完成一個(gè)就構(gòu)造另一個(gè)Request事件，開(kāi)始另一個(gè)網(wǎng)頁(yè)的抓取。
The Engine passes scraped items and new Requests returned by a spider through Spider Middleware (output direction), and then sends processed items to Item Pipelines and processed Requests to the Scheduler.

引擎作事件分發(fā)
The process repeats (from step 1) until there are no more requests from the Scheduler.

持續(xù)不斷地運(yùn)行。

標(biāo)題名稱：scrapy架構(gòu)初探
URL分享：http://www.muchs.cn/article22/jsodcc.html

成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián)，為您提供網(wǎng)站制作、云服務(wù)器、電子商務(wù)、品牌網(wǎng)站設(shè)計(jì)、網(wǎng)站收錄、自適應(yīng)網(wǎng)站

聲明：本網(wǎng)站發(fā)布的內(nèi)容（圖片、視頻和文字）以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主，如果涉及侵權(quán)請(qǐng)盡快告知，我們將會(huì)在第一時(shí)間刪除。文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng)，如需處理請(qǐng)聯(lián)系客服。電話：028-86922220；郵箱：631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載，或轉(zhuǎn)載時(shí)需注明來(lái)源：創(chuàng)新互聯(lián)

猜你還喜歡下面的內(nèi)容