Scrapy框架之爬取城市天氣預報

2019 年 10 月 5 日
筆記

Scrapy框架之爬取城市天氣預報

【今日知圖】

vi 定位

vi l.py +5 直接進入錯誤程式碼第5行  vi l.py + 直接定位最後一行

1.項目初始化2.提取數據 2.1 原理分析 2.2 數據抽取 2.3 自定義spider3.存儲數據 3.1 修改settings.py 3.2 數據存儲4.結果展示5.作者的話

1.項目初始化

創建項目

scrapy startproject weather

創建Spider

scrapy genspider CqtianqiSpider tianqi.com  '''  由於CqtianqiSpider這個名字在後面scrapy crawl CqtianqiSpider中,  CqtianqiSpider名字太長,將spider中的name改為CQtianqi,  然後命令變為：scrapy crawl CQtianqi  '''

2.提取數據

2.1 原理分析

這次目的是抽取重慶及鹽湖區7日天氣預報,具體源碼情況如上圖所示，截出的就是本次爬蟲所需要定位的地方。

接下來，定義以下存儲的數據!

date = 當日日期  week = 星期幾  img = 當日天氣圖標  wind = 當日風況  weather = 當日天氣  high_temperature = 當日最高溫度  low_temperature = 當日最低溫度

2.2 數據抽取

修改items.py

import scrapy  class WeatherItem(scrapy.Item):      # define the fields for your item here like:      # name = scrapy.Field()      collection = 'weather'      date = scrapy.Field()      week = scrapy.Field()      img = scrapy.Field()      high_temperature = scrapy.Field()      low_temperature = scrapy.Field()      weather = scrapy.Field()      wind = scrapy.Field()

2.3 自定義spider

CQtianqi.py

# -*- coding: utf-8 -*-  import scrapy  from weather.items import WeatherItem  class CqtianqiSpider(scrapy.Spider):      name = 'CQtianqi'      allowed_domains = ['tianqi.com']      start_urls = []      citys = ['chongqing','yanhuqu']      for city in citys:          start_urls.append('http://'  + 'www.tianqi.com/' + city + '/')      def parse(self, response):          '''          date = 當日日期          week = 星期幾          img = 當日天氣圖標          wind = 當日風況          weather = 當日天氣          high_temperature = 當日最高溫度          low_temperature = 當日最低溫度          :param response:          :return:          '''          # oneweek = response.xpath('//div[@class="day7"]')          item = WeatherItem()          date = response.xpath('//div[@class="day7"]//ul[@class="week"]//li//b/text()').extract()          week = response.xpath('//div[@class="day7"]//ul[@class="week"]//li//span/text()').extract()          base_url = 'http:'          img = response.xpath('//div[@class="day7"]//ul[@class="week"]//li//img/@src').extract()          imgs = []          for i in range(7):              img_i = img[i]              img_url = base_url + img_i              imgs.append(img_url)            print(date)          print(week)          print(imgs)          weather = response.xpath('//div[@class="day7"]//ul[@class="txt txt2"]//li/text()').extract()            print(weather)          high_temperature = response.xpath('//div[@class="day7"]//div[@class="zxt_shuju"]/ul//li/span/text()').extract()          low_temperature = response.xpath('//div[@class="day7"]//div[@class="zxt_shuju"]/ul//li/b/text()').extract()          print(high_temperature)          print(low_temperature)            wind = response.xpath('//div[@class="day7"]//ul[@class="txt"][1]//li/text()').extract()          print(wind)            item['date'] = date          item['week'] = week          item['img'] = imgs          item['weather'] = weather          item['wind'] = wind          item['high_temperature'] = high_temperature          item['low_temperature'] = low_temperature          yield item

3.存儲數據

3.1 修改settings.py

# 這兩行直接添加  MONGO_URI = 'localhost'  MONGO_DB = 'test'  # 以下直接修改  ITEM_PIPELINES = {     'weather.pipelines.WeatherPipeline': 300,     'weather.pipelines.W2json': 301,     'weather.pipelines.MongoPipeline': 302,     'weather.pipelines.W2mysql': 303,  }  ROBOTSTXT_OBEY = False  USER_AGENT = 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Mobile Safari/537.36'

3.2 數據存儲

修改pipelines.py

存儲MongoDB

class MongoPipeline(object):      def __init__(self, mongo_uri, mongo_db):          self.mongo_uri = mongo_uri          self.mongo_db = mongo_db        @classmethod      def from_crawler(cls, crawler):          return cls(              mongo_uri=crawler.settings.get('MONGO_URI'),              mongo_db=crawler.settings.get('MONGO_DB')          )        def open_spider(self, spider):          self.client = pymongo.MongoClient(self.mongo_uri)          self.db = self.client[self.mongo_db]        def process_item(self, item, spider):            self.db[item.collection].insert(dict(item))          return item        def close_spider(self, spider):          self.client.close()

存儲Mysql

    def process_item(self, item, spider):          '''          將爬取的資訊保存到mysql          '''            connection = pymysql.connect(host='localhost', user='root', password='xxx', db='scrapydb',                                       charset='utf8mb4')          try:                with connection.cursor() as cursor:                  for i in range(7):                      sql = "insert into `weather`(`date`,`week`,`high_temperature`,`low_temperature`,`weather`,`wind`,`img`)values(%s,%s,%s,%s,%s,%s,%s)"                      cursor.execute(sql, (                          item['date'][i], item['week'][i], item['high_temperature'][i], item['low_temperature'][i],                          item['weather'][i],                          item['wind'][i], item['img'][i]))                    connection.commit()          # except pymysql.err.IntegrityError as e:          #     print('重複數據，勿再次插入!')          finally:              connection.close()          return item

存儲至txt

class WeatherPipeline(object):      def process_item(self, item, spider):          # 文件存在data目錄下的weather.txt文件內          fiename = pathdir + '\data\weather.txt'          # 從記憶體以追加的方式打開文件，並寫入對應的數據          with open(fiename, 'a', encoding='utf8') as f:              for i in range(7):                  f.write('日期:' + item['date'][i] + 'n')                  f.write('星期:' + item['week'][i] + 'n')                  f.write('最高溫度:' + item['high_temperature'][i] + 'n')                  f.write('最低溫度' + item['low_temperature'][i] + 'n')                  f.write('天氣:' + item['weather'][i] + 'n')                  f.write('風況:' + item['wind'][i] + 'n')                  f.write('-------------------------------------' + 'n')            return item

存儲至json

class W2json(object):      def process_item(self, item, spider):          '''          講爬取的資訊保存到json          方便調用          '''          filename = pathdir + '\data\weather.json'            # 打開json文件，向裡面以dumps的方式吸入數據          # 注意需要有一個參數ensure_ascii=False ，不然數據會直接為utf編碼的方式存入比如:「/xe15」          with open(filename, 'a', encoding='utf8') as f:              line = json.dumps(dict(item), ensure_ascii=False) + 'n'              f.write(line)            return item

運行

進入到weather根目錄而不是weather下面的weather裡面哦！！！然後

scrapy crawl CQtianq

4.結果展示

數據存儲至txt

這裡只截了一部分數據，實際每個重複兩次。

數據存儲至json

這個不是重複，存儲的是兩個地區數據！

數據存儲至MongoDB

這個不是重複，存儲的是兩個地區數據！

數據存儲至MySql

這個不是重複，存儲的是兩個地區數據！

終端運行

Scrapy框架之爬取城市天氣預報

Scrapy框架之爬取城市天氣預報

1.項目初始化

2.提取數據

2.1 原理分析

2.2 數據抽取

2.3 自定義spider

3.存儲數據

3.1 修改settings.py

3.2 數據存儲

4.結果展示

VirMach 便宜 VPS

QNews

Scrapy框架之爬取城市天氣預報

Scrapy框架之爬取城市天氣預報

1.項目初始化

2.提取數據

2.1 原理分析

2.2 數據抽取

2.3 自定義spider

3.存儲數據

3.1 修改settings.py

3.2 數據存儲

4.結果展示

分享此文：

Related Posts

VS2019 配置opencv4.4

電腦圖形學——反走樣

Py無處不在，你真的感受到了？

社交網路之圖論實戰

VirMach 便宜 VPS

QNews

熱門搜尋