Scrapy框架之批量下載360妹紙圖

  • 2019 年 10 月 5 日
  • 筆記

Scrapy框架之批量下載360妹紙圖

0.導語1.項目初始化2.定義存儲結構3.Spider核心程式碼4.pipeline下載及存儲5.json知識

0.導語

爬蟲終於來了,,,好久沒更爬蟲了,現在更起來。。。。

1.項目初始化

  • 創建項目
scrapy startproject images360  
  • 創建Spider
scrapy genspider images images.so.com  

2.定義存儲結構

查看數據:

items.py

# 提取數據  class Images360Item(Item):      # define the fields for your item here like:      # name = scrapy.Field()      collection = table = 'images' # 設置MongoDB的表名為images      id = Field()      url = Field()      title = Field()      thumb = Field()  

3.Spider核心程式碼

  • settings.py
MAX_PAGE = 50 # 爬取 50 頁,每頁 30 張,一共 1500 張圖片  ROBOTSTXT_OBEY = False # 設為False,否則無法抓取  
  • images.py

分析網頁(http://images.so.com/z?ch=beauty)

注意先打開 http://images.so.com/ 然後點美女,緊接著,打開瀏覽器的檢查頁面,點擊Network的XHR,向下滑動滑鼠,動態載入出如下圖一所示的左邊一欄Name值,再往下滑動滑鼠會發現,Name裡面的sn不斷發生變化,其他的保持不變。

from scrapy import Spider, Request  from urllib.parse import urlencode  import json  from images360.items import Images360Item    class ImagesSpider(Spider):      name = 'images'      allowed_domains = ['images.so.com']      start_urls = ['http://images.so.com/']        def start_requests(self):          data = {'ch': 'beauty', 'listtype': 'new', 'temp': '1'}          base_url = 'https://image.so.com/zj?'          for page in range(1, self.settings.get('MAX_PAGE') + 1):              data['sn'] = page * 30              params = urlencode(data)              url = base_url + params              yield Request(url, self.parse)        def parse(self, response):          result = json.loads(response.text) # 字元串轉dict類型          for image in result.get('list'):              item = Images360Item()              item['id'] = image.get('imageid')              item['url'] = image.get('qhimg_url')              item['title'] = image.get('group_title')              item['thumb'] = image.get('qhimg_thumb_url')              yield item  

4.pipeline下載及存儲

  • 修改settings.py

啟用item Pipeline組件 每個pipeline後面有一個數值,這個數組的範圍是0-1000,這個數值確定了他們的運行順序,數字越小越優先

ITEM_PIPELINES = {     # 下載圖片到本地     'images360.pipelines.ImagePipeline': 300,     # 存儲至MongoDB     'images360.pipelines.MongoPipeline': 301  }    BOT_NAME = 'images360'  MAX_PAGE = 50  MONGO_URI = 'localhost'  MONGO_DB = 'test'  
  • 設置圖片存儲路徑

settings.py

import os  # 配置數據保存路徑,為當前工程目錄下的 images 目錄中  project_dir = os.path.abspath(os.path.dirname(__file__))  print(project_dir)  IMAGES_STORE = os.path.join(project_dir, 'images')  
  • 修改pipelines.py

process_item(self,item,spider)

每個item piple組件是一個獨立的pyhton類,必須實現以process_item(self,item,spider)方法;每個item pipeline組件都需要調用該方法,這個方法必須返回一個具有數據的dict,或者item對象,或者拋出DropItem異常,被丟棄的item將不會被之後的pipeline組件所處理

import pymongo  from scrapy import Request  from scrapy.exceptions import DropItem  from scrapy.pipelines.images import ImagesPipeline    class MongoPipeline(object):      def __init__(self, mongo_uri, mongo_db):          self.mongo_uri = mongo_uri          self.mongo_db = mongo_db        @classmethod      def from_crawler(cls, crawler):          return cls(              mongo_uri=crawler.settings.get('MONGO_URI'),              mongo_db=crawler.settings.get('MONGO_DB')          )        def open_spider(self, spider):          self.client = pymongo.MongoClient(self.mongo_uri)          self.db = self.client[self.mongo_db]        def process_item(self, item, spider):          self.db[item.collection].insert(dict(item))          return item        def close_spider(self, spider):          self.client.close()    class ImagePipeline(ImagesPipeline):      def file_path(self, request, response=None, info=None):          url = request.url          file_name = url.split('/')[-1]          return file_name        def item_completed(self, results, item, info):          image_paths = [x['path'] for ok, x in results if ok]          print(image_paths)          if not image_paths:              raise DropItem('Image Downloaded Failed')          return item        def get_media_requests(self, item, info):          yield Request(item['url'])  

上述程式碼解釋:

# 存儲至MongoDB實現  open_spider(self,spider)  表示當spider被開啟的時候調用這個方法  close_spider(self,spider)  當spider關掉時這個方法被調用  from_crawler(cls,crawler)  必須設置為類方法@classmethod  # 下載至本地實現  file_path(self, request, response=None, info=None)  根據request的url得到圖片的原始xx.jpg(即獲得圖片名)  get_media_requests(self,item, info):  ImagePipeline根據image_urls中指定的url進行爬取,  可以通過get_media_requests為每個url生成一個Request  item_completed(self, results, item, info):  圖片下載完畢後,處理結果會以二元組的方式返回給item_completed()函數。這個二元組定義如下:  (success, image_info_or_failure)  其中,第一個元素表示圖片是否下載成功;第二個元素是一個字典  image_paths = [x['path'] for ok, x in results if ok] #列印圖片path,比如xx.jpg  等價於  for ok, x in results:      if ok:          image_paths = [x['path']]  

5.json知識

a={      'asd':'12',      'as':'4',      'asd12':'s12',      'list': [{'a':'12'},{'b':'123'}]  }  import json  print(type(a))  a = json.dumps(a)  print(a)  print(type(a))  a = json.loads(a)  print(a)  print(type(a))  a = a.get('list')  print(a)