使用Scrapy自帶的ImagesPip

2020 年 1 月 19 日
筆記

ImagesPipeline是scrapy自帶的類，用來處理圖片（爬取時將圖片下載到本地）用的。

優勢：

將下載圖片轉換成通用的JPG和RGB格式
避免重複下載
縮略圖生成
圖片大小過濾
非同步下載
……

工作流程：

爬取一個Item，將圖片的URLs放入image_urls欄位
從Spider返回的Item，傳遞到Item Pipeline
當Item傳遞到ImagePipeline，將調用Scrapy 調度器和下載器完成image_urls中的url的調度和下載。
圖片下載成功結束後，圖片下載路徑、url和校驗和等資訊會被填充到images欄位中。

實現方式：

自定義pipeline，優勢在於可以重寫ImagePipeline類中的實現方法，可以根據情況對照片進行分類；
直接使用ImagePipeline類，簡單但不夠靈活；所有的圖片都是保存在full文件夾下，不能進行分類

實踐：爬取http://699pic.com/image/1/這個網頁下的前四個圖片集（好進行分類演示）

這裡使用方法一進行實現：

步驟一：建立項目與爬蟲

1.創建工程：scrapy startproject xxx(工程名)

2.創建爬蟲：進去到上一步創建的目錄下：scrapy genspider xxx(爬蟲名) xxx(域名)

步驟二：創建start.py

1  from scrapy import cmdline  2  3 cmdline.execute("scrapy crawl 699pic（爬蟲名）".split(" "))

步驟三：設置settings

1.關閉機器人協議，改成False

2.設置headers

3.打開ITEM_PIPELINES

將項目自動生成的pipelines注釋掉，黃色部分是下面步驟中自己寫的pipeline,這裡先不寫。

步驟四：item

1 class Img699PicItem(scrapy.Item):  2     # 分類的標題  3     category=scrapy.Field()  4     # 存放圖片地址  5     image_urls=scrapy.Field()  6     # 下載成功後返回有關images的一些相關資訊  7     images=scrapy.Field()

步驟五：寫spider

import scrapy  from ..items import Img699PicItem  import requests  from lxml import etree      class A699picSpider(scrapy.Spider):      name = '699pic'      allowed_domains = ['699pic.com']      start_urls = ['http://699pic.com/image/1/']      headers={          'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'      }            def parse(self, response):          divs=response.xpath("//div[@class='special-list clearfix']/div")[0:4]          for div in divs:              category=div.xpath("./a[@class='special-list-title']//text()").get().strip()              url=div.xpath("./a[@class='special-list-title']/@href").get().strip()              image_urls=self.parse_url(url)              item=Img699PicItem(category=category,image_urls=image_urls)              yield item        def parse_url(self,url):          response=requests.get(url=url,headers=self.headers)          htmlElement=etree.HTML(response.text)          image_urls=htmlElement.xpath("//div[@class='imgshow clearfix']//div[@class='list']/a/img/@src")          return image_urls

步驟六:pipelines

import os  from scrapy.pipelines.images import ImagesPipeline  from . import settings      class Img699PicPipeline(object):      def process_item(self, item, spider):          return item      class Images699Pipeline(ImagesPipeline):      def get_media_requests(self, item, info):          # 這個方法是在發送下載請求之前調用的，其實這個方法本身就是去發送下載請求的          request_objs=super(Images699Pipeline, self).get_media_requests(item,info)          for request_obj in request_objs:              request_obj.item=item          return request_objs        def file_path(self, request, response=None, info=None):          # 這個方法是在圖片將要被存儲的時候調用，來獲取這個圖片存儲的路徑          path=super(Images699Pipeline, self).file_path(request,response,info)          category=request.item.get('category')          image_store=settings.IMAGES_STORE          category_path=os.path.join(image_store,category)          if not os.path.exists(category_path):              os.makedirs(category_path)          image_name=path.replace("full/","")          image_path=os.path.join(category_path,image_name)          return image_path

步驟七：返回到settings中

1.將黃色部分填上

2.存放圖片的總路徑

IMAGES_STORE=os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')

使用Scrapy自帶的ImagesPip

優勢：

工作流程：

實現方式：

實踐：爬取http://699pic.com/image/1/這個網頁下的前四個圖片集（好進行分類演示）

最終結果：

VirMach 便宜 VPS

QNews

使用Scrapy自帶的ImagesPip

優勢：

工作流程：

實現方式：

實踐：爬取http://699pic.com/image/1/這個網頁下的前四個圖片集（好進行分類演示）

最終結果：

分享此文：

Related Posts

【日拱一卒】鏈表——鏈表反轉（遞歸解法）

基於點對點的社交網路的全面的調查（Social and Information Networks）

課時36：類與對象：給大家介紹對象

tornado handler 方法復用

VirMach 便宜 VPS

QNews

熱門搜尋