爬蟲框架:scrapy

閱讀目錄

一 介紹

    Scrapy一個開源和協作的框架,其最初是為了頁面抓取 (更確切來說, 網絡抓取 )所設計的,使用它可以以快速、簡單、可擴展的方式從網站中提取所需的數據。但目前Scrapy的用途十分廣泛,可用於如數據挖掘、監測和自動化測試等領域,也可以應用在獲取API所返回的數據(例如 Amazon Associates Web Services ) 或者通用的網絡爬蟲。

    Scrapy 是基於twisted框架開發而來,twisted是一個流行的事件驅動的python網絡框架。因此Scrapy使用了一種非阻塞(又名異步)的代碼來實現並發。整體架構大致如下

The data flow in Scrapy is controlled by the execution engine, and goes like this:

  1. The Engine gets the initial Requests to crawl from the Spider.
  2. The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
  3. The Scheduler returns the next Requests to the Engine.
  4. The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()).
  5. Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()).
  6. The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()).
  7. The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()).
  8. The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.
  9. The process repeats (from step 1) until there are no more requests from the Scheduler.

 

Components:

  1. 引擎(EGINE)

    引擎負責控制系統所有組件之間的數據流,並在某些動作發生時觸發事件。有關詳細信息,請參見上面的數據流部分。

  2. 調度器(SCHEDULER)
    用來接受引擎發過來的請求, 壓入隊列中, 並在引擎再次請求的時候返回. 可以想像成一個URL的優先級隊列, 由它來決定下一個要抓取的網址是什麼, 同時去除重複的網址
  3. 下載器(DOWLOADER)
    用於下載網頁內容, 並將網頁內容返回給EGINE,下載器是建立在twisted這個高效的異步模型上的
  4. 爬蟲(SPIDERS)
    SPIDERS是開發人員自定義的類,用來解析responses,並且提取items,或者發送新的請求
  5. 項目管道(ITEM PIPLINES)
    在items被提取後負責處理它們,主要包括清理、驗證、持久化(比如存到數據庫)等操作
  6. 下載器中間件(Downloader Middlewares)
    位於Scrapy引擎和下載器之間,主要用來處理從EGINE傳到DOWLOADER的請求request,已經從DOWNLOADER傳到EGINE的響應response,你可用該中間件做以下幾件事
    1. process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website);
    2. change received response before passing it to a spider;
    3. send a new Request instead of passing received response to a spider;
    4. pass response to a spider without fetching a web page;
    5. silently drop some requests.
  7. 爬蟲中間件(Spider Middlewares)
    位於EGINE和SPIDERS之間,主要工作是處理SPIDERS的輸入(即responses)和輸出(即requests)

官網鏈接://docs.scrapy.org/en/latest/topics/architecture.html

二 安裝

複製代碼
#Windows平台
    1、pip3 install wheel #安裝後,便支持通過wheel文件安裝軟件,wheel文件官網://www.lfd.uci.edu/~gohlke/pythonlibs
    3、pip3 install lxml
    4、pip3 install pyopenssl
    5、下載並安裝pywin32://sourceforge.net/projects/pywin32/files/pywin32/
    6、下載twisted的wheel文件://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
    7、執行pip3 install 下載目錄\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
    8、pip3 install scrapy
  
#Linux平台
    1、pip3 install scrapy
複製代碼

三 命令行工具

複製代碼
#1 查看幫助
    scrapy -h
    scrapy <command> -h

#2 有兩種命令:其中Project-only必須切到項目文件夾下才能執行,而Global的命令則不需要
    Global commands:
        startproject #創建項目
        genspider    #創建爬蟲程序
        settings     #如果是在項目目錄下,則得到的是該項目的配置
        runspider    #運行一個獨立的python文件,不必創建項目
        shell        #scrapy shell url地址  在交互式調試,如選擇器規則正確與否
        fetch        #獨立於程單純地爬取一個頁面,可以拿到請求頭
        view         #下載完畢後直接彈出瀏覽器,以此可以分辨出哪些數據是ajax請求
        version      #scrapy version 查看scrapy的版本,scrapy version -v查看scrapy依賴庫的版本
    Project-only commands:
        crawl        #運行爬蟲,必須創建項目才行,確保配置文件中ROBOTSTXT_OBEY = False
        check        #檢測項目中有無語法錯誤
        list         #列出項目中所包含的爬蟲名
        edit         #編輯器,一般不用
        parse        #scrapy parse url地址 --callback 回調函數  #以此可以驗證我們的回調函數是否正確
        bench        #scrapy bentch壓力測試

#3 官網鏈接
    //docs.scrapy.org/en/latest/topics/commands.html
複製代碼

#1、執行全局命令:請確保不在某個項目的目錄下,排除受該項目配置的影響
scrapy startproject MyProject

cd MyProject
scrapy genspider baidu www.baidu.com

scrapy settings --get XXX #如果切換到項目目錄下,看到的則是該項目的配置

scrapy runspider baidu.py

scrapy shell https://www.baidu.com
    response
    response.status
    response.body
    view(response)
    
scrapy view https://www.taobao.com #如果頁面顯示內容不全,不全的內容則是ajax請求實現的,以此快速定位問題

scrapy fetch --nolog --headers //www.taobao.com

scrapy version #scrapy的版本

scrapy version -v #依賴庫的版本


#2、執行項目命令:切到項目目錄下
scrapy crawl baidu
scrapy check
scrapy list
scrapy parse http://quotes.toscrape.com/ --callback parse
scrapy bench
    

示範用法

四 項目結構以及爬蟲應用簡介 

複製代碼
project_name/
   scrapy.cfg
   project_name/
       __init__.py
       items.py
       pipelines.py
       settings.py
       spiders/
           __init__.py
           爬蟲1.py
           爬蟲2.py
           爬蟲3.py
複製代碼

文件說明:

  • scrapy.cfg  項目的主配置信息,用來部署scrapy時使用,爬蟲相關的配置信息在settings.py文件中。
  • items.py    設置數據存儲模板,用於結構化數據,如:Django的Model
  • pipelines    數據處理行為,如:一般結構化的數據持久化
  • settings.py 配置文件,如:遞歸的層數、並發數,延遲下載等。強調:配置文件的選項必須大寫否則視為無效,正確寫法USER_AGENT=’xxxx’
  • spiders      爬蟲目錄,如:創建文件,編寫爬蟲規則

注意:一般創建爬蟲文件時,以網站域名命名

#在項目目錄下新建:entrypoint.py
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'xiaohua'])

默認只能在cmd中執行爬蟲,如果想在pycharm中執行需要做

import sys,os
sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')

關於windows編碼

五 Spiders

1、介紹

#1、Spiders是由一系列類(定義了一個網址或一組網址將被爬取)組成,具體包括如何執行爬取任務並且如何從頁面中提取結構化的數據。

#2、換句話說,Spiders是你為了一個特定的網址或一組網址自定義爬取和解析頁面行為的地方

2、Spiders會循環做如下事情

複製代碼
#1、生成初始的Requests來爬取第一個URLS,並且標識一個回調函數
第一個請求定義在start_requests()方法內默認從start_urls列表中獲得url地址來生成Request請求,默認的回調函數是parse方法。回調函數在下載完成返回response時自動觸發

#2、在回調函數中,解析response並且返回值
返回值可以4種:
        包含解析數據的字典
        Item對象
        新的Request對象(新的Requests也需要指定一個回調函數)
        或者是可迭代對象(包含Items或Request)

#3、在回調函數中解析頁面內容
通常使用Scrapy自帶的Selectors,但很明顯你也可以使用Beutifulsoup,lxml或其他你愛用啥用啥。

#4、最後,針對返回的Items對象將會被持久化到數據庫
通過Item Pipeline組件存到數據庫://docs.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline)
或者導出到不同的文件(通過Feed exports://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports)
複製代碼

3、Spiders總共提供了五種類:

#1、scrapy.spiders.Spider #scrapy.Spider等同於scrapy.spiders.Spider
#2、scrapy.spiders.CrawlSpider
#3、scrapy.spiders.XMLFeedSpider
#4、scrapy.spiders.CSVFeedSpider
#5、scrapy.spiders.SitemapSpider

4、導入使用

複製代碼
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import Spider,CrawlSpider,XMLFeedSpider,CSVFeedSpider,SitemapSpider

class AmazonSpider(scrapy.Spider): #自定義類,繼承Spiders提供的基類
    name = 'amazon'
    allowed_domains = ['www.amazon.cn']
    start_urls = ['//www.amazon.cn/']
    
    def parse(self, response):
        pass
複製代碼

5、class scrapy.spiders.Spider

這是最簡單的spider類,任何其他的spider類都需要繼承它(包含你自己定義的)。

該類不提供任何特殊的功能,它僅提供了一個默認的start_requests方法默認從start_urls中讀取url地址發送requests請求,並且默認parse作為回調函數

複製代碼
class AmazonSpider(scrapy.Spider):
    name = 'amazon' 
    
    allowed_domains = ['www.amazon.cn'] 
    
    start_urls = ['//www.amazon.cn/']
    
    custom_settings = {
        'BOT_NAME' : 'Egon_Spider_Amazon',
        'REQUEST_HEADERS' : {
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
          'Accept-Language': 'en',
        }
    }
    
    def parse(self, response):
        pass
複製代碼

#1、name = 'amazon' 
定義爬蟲名,scrapy會根據該值定位爬蟲程序
所以它必須要有且必須唯一(In Python 2 this must be ASCII only.)

#2、allowed_domains = ['www.amazon.cn'] 
定義允許爬取的域名,如果OffsiteMiddleware啟動(默認就啟動),
那麼不屬於該列表的域名及其子域名都不允許爬取
如果爬取的網址為:https://www.example.com/1.html,那就添加'example.com'到列表.

#3、start_urls = ['//www.amazon.cn/']
如果沒有指定url,就從該列表中讀取url來生成第一個請求

#4、custom_settings
值為一個字典,定義一些配置信息,在運行爬蟲程序時,這些配置會覆蓋項目級別的配置
所以custom_settings必須被定義成一個類屬性,由於settings會在類實例化前被加載

#5、settings
通過self.settings['配置項的名字']可以訪問settings.py中的配置,如果自己定義了custom_settings還是以自己的為準

#6、logger
日誌名默認為spider的名字
self.logger.debug('=============>%s' %self.settings['BOT_NAME'])

#5、crawler:了解
該屬性必須被定義到類方法from_crawler中

#6、from_crawler(crawler, *args, **kwargs):了解
You probably won』t need to override this directly  because the default implementation acts as a proxy to the __init__() method, calling it with the given arguments args and named arguments kwargs.

#7、start_requests()
該方法用來發起第一個Requests請求,且必須返回一個可迭代的對象。它在爬蟲程序打開時就被Scrapy調用,Scrapy只調用它一次。
默認從start_urls里取出每個url來生成Request(url, dont_filter=True)

#針對參數dont_filter,請看自定義去重規則

如果你想要改變起始爬取的Requests,你就需要覆蓋這個方法,例如你想要起始發送一個POST請求,如下
class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        return [scrapy.FormRequest("//www.example.com/login",
                                   formdata={'user': 'john', 'pass': 'secret'},
                                   callback=self.logged_in)]

    def logged_in(self, response):
        # here you would extract links to follow and return Requests for
        # each of them, with another callback
        pass
        
#8、parse(response)
這是默認的回調函數,所有的回調函數必須返回an iterable of Request and/or dicts or Item objects.

#9、log(message[, level, component]):了解
Wrapper that sends a log message through the Spider』s logger, kept for backwards compatibility. For more information see Logging from Spiders.

#10、closed(reason)
爬蟲程序結束時自動觸發

定製scrapy.spider屬性與方法詳解

去重規則應該多個爬蟲共享的,但凡一個爬蟲爬取了,其他都不要爬了,實現方式如下

#方法一:
1、新增類屬性
visited=set() #類屬性

2、回調函數parse方法內:
def parse(self, response):
    if response.url in self.visited:
        return None
    .......

    self.visited.add(response.url) 

#方法一改進:針對url可能過長,所以我們存放url的hash值
def parse(self, response):
        url=md5(response.request.url)
    if url in self.visited:
        return None
    .......

    self.visited.add(url) 

#方法二:Scrapy自帶去重功能
配置文件:
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' #默認的去重規則幫我們去重,去重規則在內存中
DUPEFILTER_DEBUG = False
JOBDIR = "保存範文記錄的日誌路徑,如:/root/"  # 最終路徑為 /root/requests.seen,去重規則放文件中

scrapy自帶去重規則默認為RFPDupeFilter,只需要我們指定
Request(...,dont_filter=False) ,如果dont_filter=True則告訴Scrapy這個URL不參與去重。

#方法三:
我們也可以仿照RFPDupeFilter自定義去重規則,

from scrapy.dupefilter import RFPDupeFilter,看源碼,仿照BaseDupeFilter

#步驟一:在項目目錄下自定義去重文件dup.py
class UrlFilter(object):
    def __init__(self):
        self.visited = set() #或者放到數據庫

    @classmethod
    def from_settings(cls, settings):
        return cls()

    def request_seen(self, request):
        if request.url in self.visited:
            return True
        self.visited.add(request.url)

    def open(self):  # can return deferred
        pass

    def close(self, reason):  # can return a deferred
        pass

    def log(self, request, spider):  # log that a request has been filtered
        pass

#步驟二:配置文件settings.py:
DUPEFILTER_CLASS = '項目名.dup.UrlFilter'


# 源碼分析:
from scrapy.core.scheduler import Scheduler
見Scheduler下的enqueue_request方法:self.df.request_seen(request)

去重規則:去除重複的url

#例一:
import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        '//www.example.com/1.html',
        '//www.example.com/2.html',
        '//www.example.com/3.html',
    ]

    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)
        
    
#例二:一個回調函數返回多個Requests和Items
import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        '//www.example.com/1.html',
        '//www.example.com/2.html',
        '//www.example.com/3.html',
    ]

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield {"title": h3}

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)
            
            
#例三:在start_requests()內直接指定起始爬取的urls,start_urls就沒有用了,

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']

    def start_requests(self):
        yield scrapy.Request('//www.example.com/1.html', self.parse)
        yield scrapy.Request('//www.example.com/2.html', self.parse)
        yield scrapy.Request('//www.example.com/3.html', self.parse)

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield MyItem(title=h3)

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)

例子

我們可能需要在命令行為爬蟲程序傳遞參數,比如傳遞初始的url,像這樣
#命令行執行
scrapy crawl myspider -a category=electronics

#在__init__方法中可以接收外部傳進來的參數
import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = ['//www.example.com/categories/%s' % category]
        #...

        
#注意接收的參數全都是字符串,如果想要結構化的數據,你需要用類似json.loads的方法

參數傳遞

6、其他通用Spiders://docs.scrapy.org/en/latest/topics/spiders.html#generic-spiders

六 Selectors

複製代碼
#1 //與/
#2 text
#3、extract與extract_first:從selector對象中解出內容
#4、屬性:xpath的屬性加前綴@
#4、嵌套查找
#5、設置默認值
#4、按照屬性查找
#5、按照屬性模糊查找
#6、正則表達式
#7、xpath相對路徑
#8、帶變量的xpath
複製代碼

response.selector.css()
response.selector.xpath()
可簡寫為
response.css()
response.xpath()

#1 //與/
response.xpath('//body/a/')#
response.css('div a::text')

>>> response.xpath('//body/a') #開頭的//代表從整篇文檔中尋找,body之後的/代表body的兒子
[]
>>> response.xpath('//body//a') #開頭的//代表從整篇文檔中尋找,body之後的//代表body的子子孫孫
[<Selector xpath='//body//a' data='<a href="image1.html">Name: My image 1 <'>, <Selector xpath='//body//a' data='<a href="image2.html">Name: My image 2 <'>, <Selector xpath='//body//a' data='<a href="
image3.html">Name: My image 3 <'>, <Selector xpath='//body//a' data='<a href="image4.html">Name: My image 4 <'>, <Selector xpath='//body//a' data='<a href="image5.html">Name: My image 5 <'>]

#2 text
>>> response.xpath('//body//a/text()')
>>> response.css('body a::text')

#3、extract與extract_first:從selector對象中解出內容
>>> response.xpath('//div/a/text()').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']
>>> response.css('div a::text').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']

>>> response.xpath('//div/a/text()').extract_first()
'Name: My image 1 '
>>> response.css('div a::text').extract_first()
'Name: My image 1 '

#4、屬性:xpath的屬性加前綴@
>>> response.xpath('//div/a/@href').extract_first()
'image1.html'
>>> response.css('div a::attr(href)').extract_first()
'image1.html'

#4、嵌套查找
>>> response.xpath('//div').css('a').xpath('@href').extract_first()
'image1.html'

#5、設置默認值
>>> response.xpath('//div[@id="xxx"]').extract_first(default="not found")
'not found'

#4、按照屬性查找
response.xpath('//div[@id="images"]/a[@href="image3.html"]/text()').extract()
response.css('#images a[@href="image3.html"]/text()').extract()

#5、按照屬性模糊查找
response.xpath('//a[contains(@href,"image")]/@href').extract()
response.css('a[href*="image"]::attr(href)').extract()

response.xpath('//a[contains(@href,"image")]/img/@src').extract()
response.css('a[href*="imag"] img::attr(src)').extract()

response.xpath('//*[@href="image1.html"]')
response.css('*[href="image1.html"]')

#6、正則表達式
response.xpath('//a/text()').re(r'Name: (.*)')
response.xpath('//a/text()').re_first(r'Name: (.*)')

#7、xpath相對路徑
>>> res=response.xpath('//a[contains(@href,"3")]')[0]
>>> res.xpath('img')
[<Selector xpath='img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('./img')
[<Selector xpath='./img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('.//img')
[<Selector xpath='.//img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('//img') #這就是從頭開始掃描
[<Selector xpath='//img' data='<img src="image1_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image2_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image3_thumb.jpg">'>, <Selector xpa
th='//img' data='<img src="image4_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image5_thumb.jpg">'>]

#8、帶變量的xpath
>>> response.xpath('//div[@id=$xxx]/a/text()',xxx='images').extract_first()
'Name: My image 1 '
>>> response.xpath('//div[count(a)=$yyy]/@id',yyy=5).extract_first() #求有5個a標籤的div的id
'images'

View Code

//docs.scrapy.org/en/latest/topics/selectors.html

七 Items

//docs.scrapy.org/en/latest/topics/items.html

八 Item Pipeline

#一:可以寫多個Pipeline類
#1、如果優先級高的Pipeline的process_item返回一個值或者None,會自動傳給下一個pipline的process_item,
#2、如果只想讓第一個Pipeline執行,那得讓第一個pipline的process_item拋出異常raise DropItem()

#3、可以用spider.name == '爬蟲名' 來控制哪些爬蟲用哪些pipeline

二:示範
from scrapy.exceptions import DropItem

class CustomPipeline(object):
    def __init__(self,v):
        self.value = v

    @classmethod
    def from_crawler(cls, crawler):
        """
        Scrapy會先通過getattr判斷我們是否自定義了from_crawler,有則調它來完
        成實例化
        """
        val = crawler.settings.getint('MMMM')
        return cls(val)

    def open_spider(self,spider):
        """
        爬蟲剛啟動時執行一次
        """
        print('000000')

    def close_spider(self,spider):
        """
        爬蟲關閉時執行一次
        """
        print('111111')


    def process_item(self, item, spider):
        # 操作並進行持久化

        # return表示會被後續的pipeline繼續處理
        return item

        # 表示將item丟棄,不會被後續pipeline處理
        # raise DropItem()

自定義pipeline

#1、settings.py
HOST="127.0.0.1"
PORT=27017
USER="root"
PWD="123"
DB="amazon"
TABLE="goods"



ITEM_PIPELINES = {
   'Amazon.pipelines.CustomPipeline': 200,
}

#2、pipelines.py
class CustomPipeline(object):
    def __init__(self,host,port,user,pwd,db,table):
        self.host=host
        self.port=port
        self.user=user
        self.pwd=pwd
        self.db=db
        self.table=table

    @classmethod
    def from_crawler(cls, crawler):
        """
        Scrapy會先通過getattr判斷我們是否自定義了from_crawler,有則調它來完
        成實例化
        """
        HOST = crawler.settings.get('HOST')
        PORT = crawler.settings.get('PORT')
        USER = crawler.settings.get('USER')
        PWD = crawler.settings.get('PWD')
        DB = crawler.settings.get('DB')
        TABLE = crawler.settings.get('TABLE')
        return cls(HOST,PORT,USER,PWD,DB,TABLE)

    def open_spider(self,spider):
        """
        爬蟲剛啟動時執行一次
        """
        self.client = MongoClient('mongodb://%s:%s@%s:%s' %(self.user,self.pwd,self.host,self.port))

    def close_spider(self,spider):
        """
        爬蟲關閉時執行一次
        """
        self.client.close()


    def process_item(self, item, spider):
        # 操作並進行持久化

        self.client[self.db][self.table].save(dict(item))

示範

//docs.scrapy.org/en/latest/topics/item-pipeline.html

九 Dowloader Middeware

複製代碼
下載中間件的用途
    1、在process——request內,自定義下載,不用scrapy的下載
    2、對請求進行二次加工,比如
        設置請求頭
        設置cookie
        添加代理
            scrapy自帶的代理組件:
                from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware
                from urllib.request import getproxies
複製代碼

class DownMiddleware1(object):
    def process_request(self, request, spider):
        """
        請求需要被下載時,經過所有下載器中間件的process_request調用
        :param request: 
        :param spider: 
        :return:  
            None,繼續後續中間件去下載;
            Response對象,停止process_request的執行,開始執行process_response
            Request對象,停止中間件的執行,將Request重新調度器
            raise IgnoreRequest異常,停止process_request的執行,開始執行process_exception
        """
        pass



    def process_response(self, request, response, spider):
        """
        spider處理完成,返回時調用
        :param response:
        :param result:
        :param spider:
        :return: 
            Response 對象:轉交給其他中間件process_response
            Request 對象:停止中間件,request會被重新調度下載
            raise IgnoreRequest 異常:調用Request.errback
        """
        print('response1')
        return response

    def process_exception(self, request, exception, spider):
        """
        當下載處理器(download handler)或 process_request() (下載中間件)拋出異常
        :param response:
        :param exception:
        :param spider:
        :return: 
            None:繼續交給後續中間件處理異常;
            Response對象:停止後續process_exception方法
            Request對象:停止中間件,request將會被重新調用下載
        """
        return None

下載器中間件

#1、與middlewares.py同級目錄下新建proxy_handle.py
import requests

def get_proxy():
    return requests.get("//127.0.0.1:5010/get/").text

def delete_proxy(proxy):
    requests.get("//127.0.0.1:5010/delete/?proxy={}".format(proxy))
    
    

#2、middlewares.py
from Amazon.proxy_handle import get_proxy,delete_proxy

class DownMiddleware1(object):
    def process_request(self, request, spider):
        """
        請求需要被下載時,經過所有下載器中間件的process_request調用
        :param request:
        :param spider:
        :return:
            None,繼續後續中間件去下載;
            Response對象,停止process_request的執行,開始執行process_response
            Request對象,停止中間件的執行,將Request重新調度器
            raise IgnoreRequest異常,停止process_request的執行,開始執行process_exception
        """
        proxy="//" + get_proxy()
        request.meta['download_timeout']=20
        request.meta["proxy"] = proxy
        print('為%s 添加代理%s ' % (request.url, proxy),end='')
        print('元數據為',request.meta)

    def process_response(self, request, response, spider):
        """
        spider處理完成,返回時調用
        :param response:
        :param result:
        :param spider:
        :return:
            Response 對象:轉交給其他中間件process_response
            Request 對象:停止中間件,request會被重新調度下載
            raise IgnoreRequest 異常:調用Request.errback
        """
        print('返回狀態嗎',response.status)
        return response


    def process_exception(self, request, exception, spider):
        """
        當下載處理器(download handler)或 process_request() (下載中間件)拋出異常
        :param response:
        :param exception:
        :param spider:
        :return:
            None:繼續交給後續中間件處理異常;
            Response對象:停止後續process_exception方法
            Request對象:停止中間件,request將會被重新調用下載
        """
        print('代理%s,訪問%s出現異常:%s' %(request.meta['proxy'],request.url,exception))
        import time
        time.sleep(5)
        delete_proxy(request.meta['proxy'].split("//")[-1])
        request.meta['proxy']='//'+get_proxy()

        return request

配置代理

十 Spider Middleware

1、爬蟲中間件方法介紹

from scrapy import signals

class SpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) #當前爬蟲執行時觸發spider_opened
        return s

    def spider_opened(self, spider):
        # spider.logger.info('我是egon派來的爬蟲1: %s' % spider.name)
        print('我是egon派來的爬蟲1: %s' % spider.name)

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn』t have a response associated.

        # Must return only requests (not items).
        print('start_requests1')
        for r in start_requests:
            yield r

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.
        # 每個response經過爬蟲中間件進入spider時調用

        # 返回值:Should return None or raise an exception.
        #1、None: 繼續執行其他中間件的process_spider_input
        #2、拋出異常:
        # 一旦拋出異常則不再執行其他中間件的process_spider_input
        # 並且觸發request綁定的errback
        # errback的返回值倒着傳給中間件的process_spider_output
        # 如果未找到errback,則倒着執行中間件的process_spider_exception

        print("input1")
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        print('output1')

        # 用yield返回多次,與return返回一次是一個道理
        # 如果生成器掌握不好(函數內有yield執行函數得到的是生成器而並不會立刻執行),生成器的形式會容易誤導你對中間件執行順序的理解
        # for i in result:
        #     yield i
        return result

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        print('exception1')

爬蟲中間件

 2、當前爬蟲啟動時以及初始請求產生時

#步驟一:
'''
打開注釋:
SPIDER_MIDDLEWARES = {
   'Baidu.middlewares.SpiderMiddleware1': 200,
   'Baidu.middlewares.SpiderMiddleware2': 300,
   'Baidu.middlewares.SpiderMiddleware3': 400,
}

'''


#步驟二:middlewares.py
from scrapy import signals

class SpiderMiddleware1(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) #當前爬蟲執行時觸發spider_opened
        return s

    def spider_opened(self, spider):
        print('我是egon派來的爬蟲1: %s' % spider.name)

    def process_start_requests(self, start_requests, spider):
        # Must return only requests (not items).
        print('start_requests1')
        for r in start_requests:
            yield r


        
        
class SpiderMiddleware2(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)  # 當前爬蟲執行時觸發spider_opened
        return s

    def spider_opened(self, spider):
        print('我是egon派來的爬蟲2: %s' % spider.name)

    def process_start_requests(self, start_requests, spider):
        print('start_requests2')
        for r in start_requests:
            yield r


class SpiderMiddleware3(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)  # 當前爬蟲執行時觸發spider_opened
        return s

    def spider_opened(self, spider):
        print('我是egon派來的爬蟲3: %s' % spider.name)

    def process_start_requests(self, start_requests, spider):
        print('start_requests3')
        for r in start_requests:
            yield r


#步驟三:分析運行結果
#1、啟動爬蟲時則立刻執行:

我是egon派來的爬蟲1: baidu
我是egon派來的爬蟲2: baidu
我是egon派來的爬蟲3: baidu


#2、然後產生一個初始的request請求,依次經過爬蟲中間件1,2,3:
start_requests1
start_requests2
start_requests3

View Code

3、process_spider_input返回None時

#步驟一:打開注釋:
SPIDER_MIDDLEWARES = {
   'Baidu.middlewares.SpiderMiddleware1': 200,
   'Baidu.middlewares.SpiderMiddleware2': 300,
   'Baidu.middlewares.SpiderMiddleware3': 400,
}

'''

#步驟二:middlewares.py
from scrapy import signals

class SpiderMiddleware1(object):

    def process_spider_input(self, response, spider):
        print("input1")

    def process_spider_output(self, response, result, spider):
        print('output1')
        return result

    def process_spider_exception(self, response, exception, spider):
        print('exception1')


class SpiderMiddleware2(object):

    def process_spider_input(self, response, spider):
        print("input2")
        return None

    def process_spider_output(self, response, result, spider):
        print('output2')
        return result

    def process_spider_exception(self, response, exception, spider):
        print('exception2')


class SpiderMiddleware3(object):

    def process_spider_input(self, response, spider):
        print("input3")
        return None

    def process_spider_output(self, response, result, spider):
        print('output3')
        return result

    def process_spider_exception(self, response, exception, spider):
        print('exception3')


#步驟三:運行結果分析

#1、返回response時,依次經過爬蟲中間件1,2,3
input1
input2
input3

#2、spider處理完畢後,依次經過爬蟲中間件3,2,1
output3
output2
output1

View Code

4、process_spider_input拋出異常時

#步驟一:
'''
打開注釋:
SPIDER_MIDDLEWARES = {
   'Baidu.middlewares.SpiderMiddleware1': 200,
   'Baidu.middlewares.SpiderMiddleware2': 300,
   'Baidu.middlewares.SpiderMiddleware3': 400,
}

'''

#步驟二:middlewares.py

from scrapy import signals

class SpiderMiddleware1(object):

    def process_spider_input(self, response, spider):
        print("input1")

    def process_spider_output(self, response, result, spider):
        print('output1')
        return result

    def process_spider_exception(self, response, exception, spider):
        print('exception1')


class SpiderMiddleware2(object):

    def process_spider_input(self, response, spider):
        print("input2")
        raise Type

    def process_spider_output(self, response, result, spider):
        print('output2')
        return result

    def process_spider_exception(self, response, exception, spider):
        print('exception2')


class SpiderMiddleware3(object):

    def process_spider_input(self, response, spider):
        print("input3")
        return None

    def process_spider_output(self, response, result, spider):
        print('output3')
        return result

    def process_spider_exception(self, response, exception, spider):
        print('exception3')

        

#運行結果        
input1
input2
exception3
exception2
exception1

#分析:
#1、當response經過中間件1的 process_spider_input返回None,繼續交給中間件2的process_spider_input
#2、中間件2的process_spider_input拋出異常,則直接跳過後續的process_spider_input,將異常信息傳遞給Spiders里該請求的errback
#3、沒有找到errback,則該response既沒有被Spiders正常的callback執行,也沒有被errback執行,即Spiders啥事也沒有干,那麼開始倒着執行process_spider_exception
#4、如果process_spider_exception返回None,代表該方法推卸掉責任,並沒處理異常,而是直接交給下一個process_spider_exception,全都返回None,則異常最終交給Engine拋出

View Code

5、指定errback

#步驟一:spider.py
import scrapy


class BaiduSpider(scrapy.Spider):
    name = 'baidu'
    allowed_domains = ['www.baidu.com']
    start_urls = ['//www.baidu.com/']


    def start_requests(self):
        yield scrapy.Request(url='//www.baidu.com/',
                             callback=self.parse,
                             errback=self.parse_err,
                             )

    def parse(self, response):
        pass

    def parse_err(self,res):
        #res 為異常信息,異常已經被該函數處理了,因此不會再拋給因此,於是開始走process_spider_output
        return [1,2,3,4,5] #提取異常信息中有用的數據以可迭代對象的形式存放於管道中,等待被process_spider_output取走



#步驟二:
'''
打開注釋:
SPIDER_MIDDLEWARES = {
   'Baidu.middlewares.SpiderMiddleware1': 200,
   'Baidu.middlewares.SpiderMiddleware2': 300,
   'Baidu.middlewares.SpiderMiddleware3': 400,
}

'''

#步驟三:middlewares.py

from scrapy import signals

class SpiderMiddleware1(object):

    def process_spider_input(self, response, spider):
        print("input1")

    def process_spider_output(self, response, result, spider):
        print('output1',list(result))
        return result

    def process_spider_exception(self, response, exception, spider):
        print('exception1')


class SpiderMiddleware2(object):

    def process_spider_input(self, response, spider):
        print("input2")
        raise TypeError('input2 拋出異常')

    def process_spider_output(self, response, result, spider):
        print('output2',list(result))
        return result

    def process_spider_exception(self, response, exception, spider):
        print('exception2')


class SpiderMiddleware3(object):

    def process_spider_input(self, response, spider):
        print("input3")
        return None

    def process_spider_output(self, response, result, spider):
        print('output3',list(result))
        return result

    def process_spider_exception(self, response, exception, spider):
        print('exception3')



#步驟四:運行結果分析
input1
input2
output3 [1, 2, 3, 4, 5] #parse_err的返回值放入管道中,只能被取走一次,在output3的方法內可以根據異常信息封裝一個新的request請求
output2 []
output1 []

View Code

十一 自定義擴展

自定義擴展(與django的信號類似)
    1、django的信號是django是預留的擴展,信號一旦被觸發,相應的功能就會執行
    2、scrapy自定義擴展的好處是可以在任意我們想要的位置添加功能,而其他組件中提供的功能只能在規定的位置執行

#1、在與settings同級目錄下新建一個文件,文件名可以為extentions.py,內容如下
from scrapy import signals


class MyExtension(object):
    def __init__(self, value):
        self.value = value

    @classmethod
    def from_crawler(cls, crawler):
        val = crawler.settings.getint('MMMM')
        obj = cls(val)

        crawler.signals.connect(obj.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(obj.spider_closed, signal=signals.spider_closed)

        return obj

    def spider_opened(self, spider):
        print('=============>open')

    def spider_closed(self, spider):
        print('=============>close')

#2、配置生效
EXTENSIONS = {
    "Amazon.extentions.MyExtension":200
}

View Code

十二 settings.py

#==>第一部分:基本配置<===
#1、項目名稱,默認的USER_AGENT由它來構成,也作為日誌記錄的日誌名
BOT_NAME = 'Amazon'

#2、爬蟲應用路徑
SPIDER_MODULES = ['Amazon.spiders']
NEWSPIDER_MODULE = 'Amazon.spiders'

#3、客戶端User-Agent請求頭
#USER_AGENT = 'Amazon (+//www.yourdomain.com)'

#4、是否遵循爬蟲協議
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

#5、是否支持cookie,cookiejar進行操作cookie,默認開啟
#COOKIES_ENABLED = False

#6、Telnet用於查看當前爬蟲的信息,操作爬蟲等...使用telnet ip port ,然後通過命令操作
#TELNETCONSOLE_ENABLED = False
#TELNETCONSOLE_HOST = '127.0.0.1'
#TELNETCONSOLE_PORT = [6023,]

#7、Scrapy發送HTTP請求默認使用的請求頭
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}



#===>第二部分:並發與延遲<===
#1、下載器總共最大處理的並發請求數,默認值16
#CONCURRENT_REQUESTS = 32

#2、每個域名能夠被執行的最大並發請求數目,默認值8
#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#3、能夠被單個IP處理的並發請求數,默認值0,代表無限制,需要注意兩點
#I、如果不為零,那CONCURRENT_REQUESTS_PER_DOMAIN將被忽略,即並發數的限制是按照每個IP來計算,而不是每個域名
#II、該設置也影響DOWNLOAD_DELAY,如果該值不為零,那麼DOWNLOAD_DELAY下載延遲是限制每個IP而不是每個域
#CONCURRENT_REQUESTS_PER_IP = 16

#4、如果沒有開啟智能限速,這個值就代表一個規定死的值,代表對同一網址延遲請求的秒數
#DOWNLOAD_DELAY = 3


#===>第三部分:智能限速/自動節流:AutoThrottle extension<===
#一:介紹
from scrapy.contrib.throttle import AutoThrottle #//scrapy.readthedocs.io/en/latest/topics/autothrottle.html#topics-autothrottle
設置目標:
1、比使用默認的下載延遲對站點更好
2、自動調整scrapy到最佳的爬取速度,所以用戶無需自己調整下載延遲到最佳狀態。用戶只需要定義允許最大並發的請求,剩下的事情由該擴展組件自動完成


#二:如何實現?
在Scrapy中,下載延遲是通過計算建立TCP連接到接收到HTTP包頭(header)之間的時間來測量的。
注意,由於Scrapy可能在忙着處理spider的回調函數或者無法下載,因此在合作的多任務環境下準確測量這些延遲是十分苦難的。 不過,這些延遲仍然是對Scrapy(甚至是服務器)繁忙程度的合理測量,而這擴展就是以此為前提進行編寫的。


#三:限速算法
自動限速算法基於以下規則調整下載延遲
#1、spiders開始時的下載延遲是基於AUTOTHROTTLE_START_DELAY的值
#2、當收到一個response,對目標站點的下載延遲=收到響應的延遲時間/AUTOTHROTTLE_TARGET_CONCURRENCY
#3、下一次請求的下載延遲就被設置成:對目標站點下載延遲時間和過去的下載延遲時間的平均值
#4、沒有達到200個response則不允許降低延遲
#5、下載延遲不能變的比DOWNLOAD_DELAY更低或者比AUTOTHROTTLE_MAX_DELAY更高

#四:配置使用
#開啟True,默認False
AUTOTHROTTLE_ENABLED = True
#起始的延遲
AUTOTHROTTLE_START_DELAY = 5
#最小延遲
DOWNLOAD_DELAY = 3
#最大延遲
AUTOTHROTTLE_MAX_DELAY = 10
#每秒並發請求數的平均值,不能高於 CONCURRENT_REQUESTS_PER_DOMAIN或CONCURRENT_REQUESTS_PER_IP,調高了則吞吐量增大強姦目標站點,調低了則對目標站點更加」禮貌「
#每個特定的時間點,scrapy並發請求的數目都可能高於或低於該值,這是爬蟲視圖達到的建議值而不是硬限制
AUTOTHROTTLE_TARGET_CONCURRENCY = 16.0
#調試
AUTOTHROTTLE_DEBUG = True
CONCURRENT_REQUESTS_PER_DOMAIN = 16
CONCURRENT_REQUESTS_PER_IP = 16



#===>第四部分:爬取深度與爬取方式<===
#1、爬蟲允許的最大深度,可以通過meta查看當前深度;0表示無深度
# DEPTH_LIMIT = 3

#2、爬取時,0表示深度優先Lifo(默認);1表示廣度優先FiFo

# 後進先出,深度優先
# DEPTH_PRIORITY = 0
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'
# 先進先出,廣度優先

# DEPTH_PRIORITY = 1
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'


#3、調度器隊列
# SCHEDULER = 'scrapy.core.scheduler.Scheduler'
# from scrapy.core.scheduler import Scheduler

#4、訪問URL去重
# DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'



#===>第五部分:中間件、Pipelines、擴展<===
#1、Enable or disable spider middlewares
# See //scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Amazon.middlewares.AmazonSpiderMiddleware': 543,
#}

#2、Enable or disable downloader middlewares
# See //scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   # 'Amazon.middlewares.DownMiddleware1': 543,
}

#3、Enable or disable extensions
# See //scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

#4、Configure item pipelines
# See //scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   # 'Amazon.pipelines.CustomPipeline': 200,
}



#===>第六部分:緩存<===
"""
1. 啟用緩存
    目的用於將已經發送的請求或相應緩存下來,以便以後使用
    
    from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware
    from scrapy.extensions.httpcache import DummyPolicy
    from scrapy.extensions.httpcache import FilesystemCacheStorage
"""
# 是否啟用緩存策略
# HTTPCACHE_ENABLED = True

# 緩存策略:所有請求均緩存,下次在請求直接訪問原來的緩存即可
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
# 緩存策略:根據Http響應頭:Cache-Control、Last-Modified 等進行緩存的策略
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"

# 緩存超時時間
# HTTPCACHE_EXPIRATION_SECS = 0

# 緩存保存路徑
# HTTPCACHE_DIR = 'httpcache'

# 緩存忽略的Http狀態碼
# HTTPCACHE_IGNORE_HTTP_CODES = []

# 緩存存儲的插件
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


#===>第七部分:線程池<===
REACTOR_THREADPOOL_MAXSIZE = 10

#Default: 10
#scrapy基於twisted異步IO框架,downloader是多線程的,線程數是Twisted線程池的默認大小(The maximum limit for Twisted Reactor thread pool size.)

#關於twisted線程池:
//twistedmatrix.com/documents/10.1.0/core/howto/threading.html

#線程池實現:twisted.python.threadpool.ThreadPool
twisted調整線程池大小:
from twisted.internet import reactor
reactor.suggestThreadPoolSize(30)

#scrapy相關源碼:
D:\python3.6\Lib\site-packages\scrapy\crawler.py

#補充:
windows下查看進程內線程數的工具:
    https://docs.microsoft.com/zh-cn/sysinternals/downloads/pslist
    或
    https://pan.baidu.com/s/1jJ0pMaM
    
    命令為:
    pslist |findstr python

linux下:top -p 進程id


#===>第八部分:其他默認配置參考<===
D:\python3.6\Lib\site-packages\scrapy\settings\default_settings.py

settings.py

 

 
Tags: