Scrapy框架之爬取城市天氣預報

  • 2019 年 10 月 5 日
  • 筆記

Scrapy框架之爬取城市天氣預報


今日知圖

vi 定位

vi l.py +5 直接進入錯誤程式碼第5行  vi l.py + 直接定位最後一行  

1.項目初始化2.提取數據 2.1 原理分析 2.2 數據抽取 2.3 自定義spider3.存儲數據 3.1 修改settings.py 3.2 數據存儲4.結果展示5.作者的話

1.項目初始化

  • 創建項目
scrapy startproject weather  
  • 創建Spider
scrapy genspider CqtianqiSpider tianqi.com  '''  由於CqtianqiSpider這個名字在後面scrapy crawl CqtianqiSpider中,  CqtianqiSpider名字太長,將spider中的name改為CQtianqi,  然後命令變為:scrapy crawl CQtianqi  '''  

2.提取數據

2.1 原理分析

這次目的是抽取重慶及鹽湖區7日天氣預報,具體源碼情況如上圖所示,截出的就是本次爬蟲所需要定位的地方。

接下來,定義以下存儲的數據!

date = 當日日期  week = 星期幾  img = 當日天氣圖標  wind = 當日風況  weather = 當日天氣  high_temperature = 當日最高溫度  low_temperature = 當日最低溫度  

2.2 數據抽取

修改items.py

import scrapy  class WeatherItem(scrapy.Item):      # define the fields for your item here like:      # name = scrapy.Field()      collection = 'weather'      date = scrapy.Field()      week = scrapy.Field()      img = scrapy.Field()      high_temperature = scrapy.Field()      low_temperature = scrapy.Field()      weather = scrapy.Field()      wind = scrapy.Field()  

2.3 自定義spider

CQtianqi.py

# -*- coding: utf-8 -*-  import scrapy  from weather.items import WeatherItem  class CqtianqiSpider(scrapy.Spider):      name = 'CQtianqi'      allowed_domains = ['tianqi.com']      start_urls = []      citys = ['chongqing','yanhuqu']      for city in citys:          start_urls.append('http://'  + 'www.tianqi.com/' + city + '/')      def parse(self, response):          '''          date = 當日日期          week = 星期幾          img = 當日天氣圖標          wind = 當日風況          weather = 當日天氣          high_temperature = 當日最高溫度          low_temperature = 當日最低溫度          :param response:          :return:          '''          # oneweek = response.xpath('//div[@class="day7"]')          item = WeatherItem()          date = response.xpath('//div[@class="day7"]//ul[@class="week"]//li//b/text()').extract()          week = response.xpath('//div[@class="day7"]//ul[@class="week"]//li//span/text()').extract()          base_url = 'http:'          img = response.xpath('//div[@class="day7"]//ul[@class="week"]//li//img/@src').extract()          imgs = []          for i in range(7):              img_i = img[i]              img_url = base_url + img_i              imgs.append(img_url)            print(date)          print(week)          print(imgs)          weather = response.xpath('//div[@class="day7"]//ul[@class="txt txt2"]//li/text()').extract()            print(weather)          high_temperature = response.xpath('//div[@class="day7"]//div[@class="zxt_shuju"]/ul//li/span/text()').extract()          low_temperature = response.xpath('//div[@class="day7"]//div[@class="zxt_shuju"]/ul//li/b/text()').extract()          print(high_temperature)          print(low_temperature)            wind = response.xpath('//div[@class="day7"]//ul[@class="txt"][1]//li/text()').extract()          print(wind)            item['date'] = date          item['week'] = week          item['img'] = imgs          item['weather'] = weather          item['wind'] = wind          item['high_temperature'] = high_temperature          item['low_temperature'] = low_temperature          yield item  

3.存儲數據

3.1 修改settings.py

# 這兩行直接添加  MONGO_URI = 'localhost'  MONGO_DB = 'test'  # 以下直接修改  ITEM_PIPELINES = {     'weather.pipelines.WeatherPipeline': 300,     'weather.pipelines.W2json': 301,     'weather.pipelines.MongoPipeline': 302,     'weather.pipelines.W2mysql': 303,  }  ROBOTSTXT_OBEY = False  USER_AGENT = 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Mobile Safari/537.36'  

3.2 數據存儲

修改pipelines.py

存儲MongoDB

class MongoPipeline(object):      def __init__(self, mongo_uri, mongo_db):          self.mongo_uri = mongo_uri          self.mongo_db = mongo_db        @classmethod      def from_crawler(cls, crawler):          return cls(              mongo_uri=crawler.settings.get('MONGO_URI'),              mongo_db=crawler.settings.get('MONGO_DB')          )        def open_spider(self, spider):          self.client = pymongo.MongoClient(self.mongo_uri)          self.db = self.client[self.mongo_db]        def process_item(self, item, spider):            self.db[item.collection].insert(dict(item))          return item        def close_spider(self, spider):          self.client.close()  

存儲Mysql

    def process_item(self, item, spider):          '''          將爬取的資訊保存到mysql          '''            connection = pymysql.connect(host='localhost', user='root', password='xxx', db='scrapydb',                                       charset='utf8mb4')          try:                with connection.cursor() as cursor:                  for i in range(7):                      sql = "insert into `weather`(`date`,`week`,`high_temperature`,`low_temperature`,`weather`,`wind`,`img`)values(%s,%s,%s,%s,%s,%s,%s)"                      cursor.execute(sql, (                          item['date'][i], item['week'][i], item['high_temperature'][i], item['low_temperature'][i],                          item['weather'][i],                          item['wind'][i], item['img'][i]))                    connection.commit()          # except pymysql.err.IntegrityError as e:          #     print('重複數據,勿再次插入!')          finally:              connection.close()          return item  

存儲至txt

class WeatherPipeline(object):      def process_item(self, item, spider):          # 文件存在data目錄下的weather.txt文件內          fiename = pathdir + '\data\weather.txt'          # 從記憶體以追加的方式打開文件,並寫入對應的數據          with open(fiename, 'a', encoding='utf8') as f:              for i in range(7):                  f.write('日期:' + item['date'][i] + 'n')                  f.write('星期:' + item['week'][i] + 'n')                  f.write('最高溫度:' + item['high_temperature'][i] + 'n')                  f.write('最低溫度' + item['low_temperature'][i] + 'n')                  f.write('天氣:' + item['weather'][i] + 'n')                  f.write('風況:' + item['wind'][i] + 'n')                  f.write('-------------------------------------' + 'n')            return item  

存儲至json

class W2json(object):      def process_item(self, item, spider):          '''          講爬取的資訊保存到json          方便調用          '''          filename = pathdir + '\data\weather.json'            # 打開json文件,向裡面以dumps的方式吸入數據          # 注意需要有一個參數ensure_ascii=False ,不然數據會直接為utf編碼的方式存入比如:「/xe15」          with open(filename, 'a', encoding='utf8') as f:              line = json.dumps(dict(item), ensure_ascii=False) + 'n'              f.write(line)            return item

運行

進入到weather根目錄而不是weather下面的weather裡面哦!!! 然後

scrapy crawl CQtianq

4.結果展示

數據存儲至txt

這裡只截了一部分數據,實際每個重複兩次。

數據存儲至json

這個不是重複,存儲的是兩個地區數據!

數據存儲至MongoDB

這個不是重複,存儲的是兩個地區數據!

數據存儲至MySql

這個不是重複,存儲的是兩個地區數據!

終端運行