Scrapy框架之爬取城市天氣預報
- 2019 年 10 月 5 日
- 筆記

Scrapy框架之爬取城市天氣預報
【今日知圖】
vi 定位
vi l.py +5 直接進入錯誤程式碼第5行 vi l.py + 直接定位最後一行
1.項目初始化2.提取數據 2.1 原理分析 2.2 數據抽取 2.3 自定義spider3.存儲數據 3.1 修改settings.py 3.2 數據存儲4.結果展示5.作者的話
1.項目初始化
- 創建項目
scrapy startproject weather
- 創建Spider
scrapy genspider CqtianqiSpider tianqi.com ''' 由於CqtianqiSpider這個名字在後面scrapy crawl CqtianqiSpider中, CqtianqiSpider名字太長,將spider中的name改為CQtianqi, 然後命令變為:scrapy crawl CQtianqi '''
2.提取數據
2.1 原理分析

這次目的是抽取重慶及鹽湖區7日天氣預報,具體源碼情況如上圖所示,截出的就是本次爬蟲所需要定位的地方。
接下來,定義以下存儲的數據!
date = 當日日期 week = 星期幾 img = 當日天氣圖標 wind = 當日風況 weather = 當日天氣 high_temperature = 當日最高溫度 low_temperature = 當日最低溫度
2.2 數據抽取
修改items.py
import scrapy class WeatherItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() collection = 'weather' date = scrapy.Field() week = scrapy.Field() img = scrapy.Field() high_temperature = scrapy.Field() low_temperature = scrapy.Field() weather = scrapy.Field() wind = scrapy.Field()
2.3 自定義spider
CQtianqi.py
# -*- coding: utf-8 -*- import scrapy from weather.items import WeatherItem class CqtianqiSpider(scrapy.Spider): name = 'CQtianqi' allowed_domains = ['tianqi.com'] start_urls = [] citys = ['chongqing','yanhuqu'] for city in citys: start_urls.append('http://' + 'www.tianqi.com/' + city + '/') def parse(self, response): ''' date = 當日日期 week = 星期幾 img = 當日天氣圖標 wind = 當日風況 weather = 當日天氣 high_temperature = 當日最高溫度 low_temperature = 當日最低溫度 :param response: :return: ''' # oneweek = response.xpath('//div[@class="day7"]') item = WeatherItem() date = response.xpath('//div[@class="day7"]//ul[@class="week"]//li//b/text()').extract() week = response.xpath('//div[@class="day7"]//ul[@class="week"]//li//span/text()').extract() base_url = 'http:' img = response.xpath('//div[@class="day7"]//ul[@class="week"]//li//img/@src').extract() imgs = [] for i in range(7): img_i = img[i] img_url = base_url + img_i imgs.append(img_url) print(date) print(week) print(imgs) weather = response.xpath('//div[@class="day7"]//ul[@class="txt txt2"]//li/text()').extract() print(weather) high_temperature = response.xpath('//div[@class="day7"]//div[@class="zxt_shuju"]/ul//li/span/text()').extract() low_temperature = response.xpath('//div[@class="day7"]//div[@class="zxt_shuju"]/ul//li/b/text()').extract() print(high_temperature) print(low_temperature) wind = response.xpath('//div[@class="day7"]//ul[@class="txt"][1]//li/text()').extract() print(wind) item['date'] = date item['week'] = week item['img'] = imgs item['weather'] = weather item['wind'] = wind item['high_temperature'] = high_temperature item['low_temperature'] = low_temperature yield item
3.存儲數據
3.1 修改settings.py
# 這兩行直接添加 MONGO_URI = 'localhost' MONGO_DB = 'test' # 以下直接修改 ITEM_PIPELINES = { 'weather.pipelines.WeatherPipeline': 300, 'weather.pipelines.W2json': 301, 'weather.pipelines.MongoPipeline': 302, 'weather.pipelines.W2mysql': 303, } ROBOTSTXT_OBEY = False USER_AGENT = 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Mobile Safari/537.36'
3.2 數據存儲
修改pipelines.py
存儲MongoDB
class MongoPipeline(object): def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DB') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def process_item(self, item, spider): self.db[item.collection].insert(dict(item)) return item def close_spider(self, spider): self.client.close()
存儲Mysql
def process_item(self, item, spider): ''' 將爬取的資訊保存到mysql ''' connection = pymysql.connect(host='localhost', user='root', password='xxx', db='scrapydb', charset='utf8mb4') try: with connection.cursor() as cursor: for i in range(7): sql = "insert into `weather`(`date`,`week`,`high_temperature`,`low_temperature`,`weather`,`wind`,`img`)values(%s,%s,%s,%s,%s,%s,%s)" cursor.execute(sql, ( item['date'][i], item['week'][i], item['high_temperature'][i], item['low_temperature'][i], item['weather'][i], item['wind'][i], item['img'][i])) connection.commit() # except pymysql.err.IntegrityError as e: # print('重複數據,勿再次插入!') finally: connection.close() return item
存儲至txt
class WeatherPipeline(object): def process_item(self, item, spider): # 文件存在data目錄下的weather.txt文件內 fiename = pathdir + '\data\weather.txt' # 從記憶體以追加的方式打開文件,並寫入對應的數據 with open(fiename, 'a', encoding='utf8') as f: for i in range(7): f.write('日期:' + item['date'][i] + 'n') f.write('星期:' + item['week'][i] + 'n') f.write('最高溫度:' + item['high_temperature'][i] + 'n') f.write('最低溫度' + item['low_temperature'][i] + 'n') f.write('天氣:' + item['weather'][i] + 'n') f.write('風況:' + item['wind'][i] + 'n') f.write('-------------------------------------' + 'n') return item
存儲至json
class W2json(object): def process_item(self, item, spider): ''' 講爬取的資訊保存到json 方便調用 ''' filename = pathdir + '\data\weather.json' # 打開json文件,向裡面以dumps的方式吸入數據 # 注意需要有一個參數ensure_ascii=False ,不然數據會直接為utf編碼的方式存入比如:「/xe15」 with open(filename, 'a', encoding='utf8') as f: line = json.dumps(dict(item), ensure_ascii=False) + 'n' f.write(line) return item
運行
進入到weather根目錄而不是weather下面的weather裡面哦!!! 然後
scrapy crawl CQtianq
4.結果展示
數據存儲至txt
這裡只截了一部分數據,實際每個重複兩次。

數據存儲至json
這個不是重複,存儲的是兩個地區數據!

數據存儲至MongoDB
這個不是重複,存儲的是兩個地區數據!

數據存儲至MySql
這個不是重複,存儲的是兩個地區數據!

終端運行
