Scrapy框架之批量下載360妹紙圖
- 2019 年 10 月 5 日
- 筆記

Scrapy框架之批量下載360妹紙圖
0.導語1.項目初始化2.定義存儲結構3.Spider核心程式碼4.pipeline下載及存儲5.json知識
0.導語
爬蟲終於來了,,,好久沒更爬蟲了,現在更起來。。。。
1.項目初始化
- 創建項目
scrapy startproject images360
- 創建Spider
scrapy genspider images images.so.com
2.定義存儲結構
查看數據:

items.py
# 提取數據 class Images360Item(Item): # define the fields for your item here like: # name = scrapy.Field() collection = table = 'images' # 設置MongoDB的表名為images id = Field() url = Field() title = Field() thumb = Field()
3.Spider核心程式碼
- settings.py
MAX_PAGE = 50 # 爬取 50 頁,每頁 30 張,一共 1500 張圖片 ROBOTSTXT_OBEY = False # 設為False,否則無法抓取
- images.py
分析網頁(http://images.so.com/z?ch=beauty)
注意先打開 http://images.so.com/ 然後點美女,緊接著,打開瀏覽器的檢查頁面,點擊Network的XHR,向下滑動滑鼠,動態載入出如下圖一所示的左邊一欄Name值,再往下滑動滑鼠會發現,Name裡面的sn不斷發生變化,其他的保持不變。


from scrapy import Spider, Request from urllib.parse import urlencode import json from images360.items import Images360Item class ImagesSpider(Spider): name = 'images' allowed_domains = ['images.so.com'] start_urls = ['http://images.so.com/'] def start_requests(self): data = {'ch': 'beauty', 'listtype': 'new', 'temp': '1'} base_url = 'https://image.so.com/zj?' for page in range(1, self.settings.get('MAX_PAGE') + 1): data['sn'] = page * 30 params = urlencode(data) url = base_url + params yield Request(url, self.parse) def parse(self, response): result = json.loads(response.text) # 字元串轉dict類型 for image in result.get('list'): item = Images360Item() item['id'] = image.get('imageid') item['url'] = image.get('qhimg_url') item['title'] = image.get('group_title') item['thumb'] = image.get('qhimg_thumb_url') yield item
4.pipeline下載及存儲
- 修改settings.py
啟用item Pipeline組件 每個pipeline後面有一個數值,這個數組的範圍是0-1000,這個數值確定了他們的運行順序,數字越小越優先
ITEM_PIPELINES = { # 下載圖片到本地 'images360.pipelines.ImagePipeline': 300, # 存儲至MongoDB 'images360.pipelines.MongoPipeline': 301 } BOT_NAME = 'images360' MAX_PAGE = 50 MONGO_URI = 'localhost' MONGO_DB = 'test'
- 設置圖片存儲路徑
settings.py
import os # 配置數據保存路徑,為當前工程目錄下的 images 目錄中 project_dir = os.path.abspath(os.path.dirname(__file__)) print(project_dir) IMAGES_STORE = os.path.join(project_dir, 'images')
- 修改pipelines.py
process_item(self,item,spider)
每個item piple組件是一個獨立的pyhton類,必須實現以process_item(self,item,spider)方法;每個item pipeline組件都需要調用該方法,這個方法必須返回一個具有數據的dict,或者item對象,或者拋出DropItem異常,被丟棄的item將不會被之後的pipeline組件所處理
import pymongo from scrapy import Request from scrapy.exceptions import DropItem from scrapy.pipelines.images import ImagesPipeline class MongoPipeline(object): def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DB') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def process_item(self, item, spider): self.db[item.collection].insert(dict(item)) return item def close_spider(self, spider): self.client.close() class ImagePipeline(ImagesPipeline): def file_path(self, request, response=None, info=None): url = request.url file_name = url.split('/')[-1] return file_name def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] print(image_paths) if not image_paths: raise DropItem('Image Downloaded Failed') return item def get_media_requests(self, item, info): yield Request(item['url'])
上述程式碼解釋:
# 存儲至MongoDB實現 open_spider(self,spider) 表示當spider被開啟的時候調用這個方法 close_spider(self,spider) 當spider關掉時這個方法被調用 from_crawler(cls,crawler) 必須設置為類方法@classmethod # 下載至本地實現 file_path(self, request, response=None, info=None) 根據request的url得到圖片的原始xx.jpg(即獲得圖片名) get_media_requests(self,item, info): ImagePipeline根據image_urls中指定的url進行爬取, 可以通過get_media_requests為每個url生成一個Request item_completed(self, results, item, info): 圖片下載完畢後,處理結果會以二元組的方式返回給item_completed()函數。這個二元組定義如下: (success, image_info_or_failure) 其中,第一個元素表示圖片是否下載成功;第二個元素是一個字典 image_paths = [x['path'] for ok, x in results if ok] #列印圖片path,比如xx.jpg 等價於 for ok, x in results: if ok: image_paths = [x['path']]
5.json知識
a={ 'asd':'12', 'as':'4', 'asd12':'s12', 'list': [{'a':'12'},{'b':'123'}] } import json print(type(a)) a = json.dumps(a) print(a) print(type(a)) a = json.loads(a) print(a) print(type(a)) a = a.get('list') print(a)