Scrapy框架之爬取城市天气预报

  • 2019 年 10 月 5 日
  • 笔记

Scrapy框架之爬取城市天气预报


今日知图

vi 定位

vi l.py +5 直接进入错误代码第5行  vi l.py + 直接定位最后一行  

1.项目初始化2.提取数据 2.1 原理分析 2.2 数据抽取 2.3 自定义spider3.存储数据 3.1 修改settings.py 3.2 数据存储4.结果展示5.作者的话

1.项目初始化

  • 创建项目
scrapy startproject weather  
  • 创建Spider
scrapy genspider CqtianqiSpider tianqi.com  '''  由于CqtianqiSpider这个名字在后面scrapy crawl CqtianqiSpider中,  CqtianqiSpider名字太长,将spider中的name改为CQtianqi,  然后命令变为:scrapy crawl CQtianqi  '''  

2.提取数据

2.1 原理分析

这次目的是抽取重庆及盐湖区7日天气预报,具体源码情况如上图所示,截出的就是本次爬虫所需要定位的地方。

接下来,定义以下存储的数据!

date = 当日日期  week = 星期几  img = 当日天气图标  wind = 当日风况  weather = 当日天气  high_temperature = 当日最高温度  low_temperature = 当日最低温度  

2.2 数据抽取

修改items.py

import scrapy  class WeatherItem(scrapy.Item):      # define the fields for your item here like:      # name = scrapy.Field()      collection = 'weather'      date = scrapy.Field()      week = scrapy.Field()      img = scrapy.Field()      high_temperature = scrapy.Field()      low_temperature = scrapy.Field()      weather = scrapy.Field()      wind = scrapy.Field()  

2.3 自定义spider

CQtianqi.py

# -*- coding: utf-8 -*-  import scrapy  from weather.items import WeatherItem  class CqtianqiSpider(scrapy.Spider):      name = 'CQtianqi'      allowed_domains = ['tianqi.com']      start_urls = []      citys = ['chongqing','yanhuqu']      for city in citys:          start_urls.append('http://'  + 'www.tianqi.com/' + city + '/')      def parse(self, response):          '''          date = 当日日期          week = 星期几          img = 当日天气图标          wind = 当日风况          weather = 当日天气          high_temperature = 当日最高温度          low_temperature = 当日最低温度          :param response:          :return:          '''          # oneweek = response.xpath('//div[@class="day7"]')          item = WeatherItem()          date = response.xpath('//div[@class="day7"]//ul[@class="week"]//li//b/text()').extract()          week = response.xpath('//div[@class="day7"]//ul[@class="week"]//li//span/text()').extract()          base_url = 'http:'          img = response.xpath('//div[@class="day7"]//ul[@class="week"]//li//img/@src').extract()          imgs = []          for i in range(7):              img_i = img[i]              img_url = base_url + img_i              imgs.append(img_url)            print(date)          print(week)          print(imgs)          weather = response.xpath('//div[@class="day7"]//ul[@class="txt txt2"]//li/text()').extract()            print(weather)          high_temperature = response.xpath('//div[@class="day7"]//div[@class="zxt_shuju"]/ul//li/span/text()').extract()          low_temperature = response.xpath('//div[@class="day7"]//div[@class="zxt_shuju"]/ul//li/b/text()').extract()          print(high_temperature)          print(low_temperature)            wind = response.xpath('//div[@class="day7"]//ul[@class="txt"][1]//li/text()').extract()          print(wind)            item['date'] = date          item['week'] = week          item['img'] = imgs          item['weather'] = weather          item['wind'] = wind          item['high_temperature'] = high_temperature          item['low_temperature'] = low_temperature          yield item  

3.存储数据

3.1 修改settings.py

# 这两行直接添加  MONGO_URI = 'localhost'  MONGO_DB = 'test'  # 以下直接修改  ITEM_PIPELINES = {     'weather.pipelines.WeatherPipeline': 300,     'weather.pipelines.W2json': 301,     'weather.pipelines.MongoPipeline': 302,     'weather.pipelines.W2mysql': 303,  }  ROBOTSTXT_OBEY = False  USER_AGENT = 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Mobile Safari/537.36'  

3.2 数据存储

修改pipelines.py

存储MongoDB

class MongoPipeline(object):      def __init__(self, mongo_uri, mongo_db):          self.mongo_uri = mongo_uri          self.mongo_db = mongo_db        @classmethod      def from_crawler(cls, crawler):          return cls(              mongo_uri=crawler.settings.get('MONGO_URI'),              mongo_db=crawler.settings.get('MONGO_DB')          )        def open_spider(self, spider):          self.client = pymongo.MongoClient(self.mongo_uri)          self.db = self.client[self.mongo_db]        def process_item(self, item, spider):            self.db[item.collection].insert(dict(item))          return item        def close_spider(self, spider):          self.client.close()  

存储Mysql

    def process_item(self, item, spider):          '''          将爬取的信息保存到mysql          '''            connection = pymysql.connect(host='localhost', user='root', password='xxx', db='scrapydb',                                       charset='utf8mb4')          try:                with connection.cursor() as cursor:                  for i in range(7):                      sql = "insert into `weather`(`date`,`week`,`high_temperature`,`low_temperature`,`weather`,`wind`,`img`)values(%s,%s,%s,%s,%s,%s,%s)"                      cursor.execute(sql, (                          item['date'][i], item['week'][i], item['high_temperature'][i], item['low_temperature'][i],                          item['weather'][i],                          item['wind'][i], item['img'][i]))                    connection.commit()          # except pymysql.err.IntegrityError as e:          #     print('重复数据,勿再次插入!')          finally:              connection.close()          return item  

存储至txt

class WeatherPipeline(object):      def process_item(self, item, spider):          # 文件存在data目录下的weather.txt文件内          fiename = pathdir + '\data\weather.txt'          # 从内存以追加的方式打开文件,并写入对应的数据          with open(fiename, 'a', encoding='utf8') as f:              for i in range(7):                  f.write('日期:' + item['date'][i] + 'n')                  f.write('星期:' + item['week'][i] + 'n')                  f.write('最高温度:' + item['high_temperature'][i] + 'n')                  f.write('最低温度' + item['low_temperature'][i] + 'n')                  f.write('天气:' + item['weather'][i] + 'n')                  f.write('风况:' + item['wind'][i] + 'n')                  f.write('-------------------------------------' + 'n')            return item  

存储至json

class W2json(object):      def process_item(self, item, spider):          '''          讲爬取的信息保存到json          方便调用          '''          filename = pathdir + '\data\weather.json'            # 打开json文件,向里面以dumps的方式吸入数据          # 注意需要有一个参数ensure_ascii=False ,不然数据会直接为utf编码的方式存入比如:“/xe15”          with open(filename, 'a', encoding='utf8') as f:              line = json.dumps(dict(item), ensure_ascii=False) + 'n'              f.write(line)            return item

运行

进入到weather根目录而不是weather下面的weather里面哦!!! 然后

scrapy crawl CQtianq

4.结果展示

数据存储至txt

这里只截了一部分数据,实际每个重复两次。

数据存储至json

这个不是重复,存储的是两个地区数据!

数据存储至MongoDB

这个不是重复,存储的是两个地区数据!

数据存储至MySql

这个不是重复,存储的是两个地区数据!

终端运行