Scrapy框架之批量下载360妹纸图

  • 2019 年 10 月 5 日
  • 筆記

Scrapy框架之批量下载360妹纸图

0.导语1.项目初始化2.定义存储结构3.Spider核心代码4.pipeline下载及存储5.json知识

0.导语

爬虫终于来了,,,好久没更爬虫了,现在更起来。。。。

1.项目初始化

  • 创建项目
scrapy startproject images360  
  • 创建Spider
scrapy genspider images images.so.com  

2.定义存储结构

查看数据:

items.py

# 提取数据  class Images360Item(Item):      # define the fields for your item here like:      # name = scrapy.Field()      collection = table = 'images' # 设置MongoDB的表名为images      id = Field()      url = Field()      title = Field()      thumb = Field()  

3.Spider核心代码

  • settings.py
MAX_PAGE = 50 # 爬取 50 页,每页 30 张,一共 1500 张图片  ROBOTSTXT_OBEY = False # 设为False,否则无法抓取  
  • images.py

分析网页(http://images.so.com/z?ch=beauty)

注意先打开 http://images.so.com/ 然后点美女,紧接着,打开浏览器的检查页面,点击Network的XHR,向下滑动鼠标,动态加载出如下图一所示的左边一栏Name值,再往下滑动鼠标会发现,Name里面的sn不断发生变化,其他的保持不变。

from scrapy import Spider, Request  from urllib.parse import urlencode  import json  from images360.items import Images360Item    class ImagesSpider(Spider):      name = 'images'      allowed_domains = ['images.so.com']      start_urls = ['http://images.so.com/']        def start_requests(self):          data = {'ch': 'beauty', 'listtype': 'new', 'temp': '1'}          base_url = 'https://image.so.com/zj?'          for page in range(1, self.settings.get('MAX_PAGE') + 1):              data['sn'] = page * 30              params = urlencode(data)              url = base_url + params              yield Request(url, self.parse)        def parse(self, response):          result = json.loads(response.text) # 字符串转dict类型          for image in result.get('list'):              item = Images360Item()              item['id'] = image.get('imageid')              item['url'] = image.get('qhimg_url')              item['title'] = image.get('group_title')              item['thumb'] = image.get('qhimg_thumb_url')              yield item  

4.pipeline下载及存储

  • 修改settings.py

启用item Pipeline组件 每个pipeline后面有一个数值,这个数组的范围是0-1000,这个数值确定了他们的运行顺序,数字越小越优先

ITEM_PIPELINES = {     # 下载图片到本地     'images360.pipelines.ImagePipeline': 300,     # 存储至MongoDB     'images360.pipelines.MongoPipeline': 301  }    BOT_NAME = 'images360'  MAX_PAGE = 50  MONGO_URI = 'localhost'  MONGO_DB = 'test'  
  • 设置图片存储路径

settings.py

import os  # 配置数据保存路径,为当前工程目录下的 images 目录中  project_dir = os.path.abspath(os.path.dirname(__file__))  print(project_dir)  IMAGES_STORE = os.path.join(project_dir, 'images')  
  • 修改pipelines.py

process_item(self,item,spider)

每个item piple组件是一个独立的pyhton类,必须实现以process_item(self,item,spider)方法;每个item pipeline组件都需要调用该方法,这个方法必须返回一个具有数据的dict,或者item对象,或者抛出DropItem异常,被丢弃的item将不会被之后的pipeline组件所处理

import pymongo  from scrapy import Request  from scrapy.exceptions import DropItem  from scrapy.pipelines.images import ImagesPipeline    class MongoPipeline(object):      def __init__(self, mongo_uri, mongo_db):          self.mongo_uri = mongo_uri          self.mongo_db = mongo_db        @classmethod      def from_crawler(cls, crawler):          return cls(              mongo_uri=crawler.settings.get('MONGO_URI'),              mongo_db=crawler.settings.get('MONGO_DB')          )        def open_spider(self, spider):          self.client = pymongo.MongoClient(self.mongo_uri)          self.db = self.client[self.mongo_db]        def process_item(self, item, spider):          self.db[item.collection].insert(dict(item))          return item        def close_spider(self, spider):          self.client.close()    class ImagePipeline(ImagesPipeline):      def file_path(self, request, response=None, info=None):          url = request.url          file_name = url.split('/')[-1]          return file_name        def item_completed(self, results, item, info):          image_paths = [x['path'] for ok, x in results if ok]          print(image_paths)          if not image_paths:              raise DropItem('Image Downloaded Failed')          return item        def get_media_requests(self, item, info):          yield Request(item['url'])  

上述代码解释:

# 存储至MongoDB实现  open_spider(self,spider)  表示当spider被开启的时候调用这个方法  close_spider(self,spider)  当spider关掉时这个方法被调用  from_crawler(cls,crawler)  必须设置为类方法@classmethod  # 下载至本地实现  file_path(self, request, response=None, info=None)  根据request的url得到图片的原始xx.jpg(即获得图片名)  get_media_requests(self,item, info):  ImagePipeline根据image_urls中指定的url进行爬取,  可以通过get_media_requests为每个url生成一个Request  item_completed(self, results, item, info):  图片下载完毕后,处理结果会以二元组的方式返回给item_completed()函数。这个二元组定义如下:  (success, image_info_or_failure)  其中,第一个元素表示图片是否下载成功;第二个元素是一个字典  image_paths = [x['path'] for ok, x in results if ok] #打印图片path,比如xx.jpg  等价于  for ok, x in results:      if ok:          image_paths = [x['path']]  

5.json知识

a={      'asd':'12',      'as':'4',      'asd12':'s12',      'list': [{'a':'12'},{'b':'123'}]  }  import json  print(type(a))  a = json.dumps(a)  print(a)  print(type(a))  a = json.loads(a)  print(a)  print(type(a))  a = a.get('list')  print(a)