Python Scrapy 爬蟲框架 | 5、利用 pipelines 和 settings 將爬取數據存儲到 MongoDB

2019 年 12 月 31 日
筆記

0x00 前言

前文中講到了將爬取的數據導出到文件中，接下來就在前文的程式碼基礎之上，將數據導出到 MongoDB中。

0x01 配置 pipelines.py

首先來到 pipelines.py 文件下，在這裡寫入連接操作資料庫的一些功能。

將連接操作 mongo 所需要的包導入進來

import pymongo

接下來定義一些參數，注意下面的函數都是在 TeamssixPipeline 類下的

@classmethod  def from_crawler(cls, crawler):  cls.DB_URL = crawler.settings.get('MONGO_DB_URI')  cls.DB_NAME = crawler.settings.get('MONGO_DB_NAME')  return cls()    def open_spider(self, spider):  self.client = pymongo.MongoClient(self.DB_URL)  self.db = self.client[self.DB_NAME]    def close_spider(self, spider):  self.client.close()    def process_item(self, item, spider):  collection = self.db[spider.name]  collection.insert_one(dict(item))  return item

0x02 配置 settings.py

ITEM_PIPELINES 是settings.py 文件自帶的，把注釋符號刪掉就好

ITEM_PIPELINES = {  'teamssix.pipelines.TeamssixPipeline': 300,  #優先順序，1-1000，數值越低優先順序越高  }  MONGO_DB_URI = 'mongodb://localhost:27017'  #mongodb 的連接 url  MONGO_DB_NAME = 'blog'  #要連接的庫

0x02 運行

直接執行命令，不加參數

scrapy crawl blogurl

注意，如果原來 MongoDB 中沒有我們要連接的庫， MongoDB 會自己創建，就不需要自己創建了，所以還是蠻方便的，使用 Robo 3T 打開後，就能看到剛才存進的數據。

參考鏈接： https://youtu.be/aDwAmj3VWH4 http://doc.scrapy.org/en/latest/topics/architecture.html https://lemmo.xyz/post/Scrapy-To-MongoDB-By-Pipeline.html

Python Scrapy 爬蟲框架 | 5、利用 pipelines 和 settings 將爬取數據存儲到 MongoDB

0x00 前言

0x01 配置 pipelines.py

0x02 配置 settings.py

0x02 運行

VirMach 便宜 VPS

QNews

Python Scrapy 爬蟲框架 | 5、利用 pipelines 和 settings 將爬取數據存儲到 MongoDB

0x00 前言

0x01 配置 pipelines.py

0x02 配置 settings.py

0x02 運行

分享此文：

Related Posts

Docker學習筆記—通俗易懂

Java基礎：五、方法重載（2）

Git使用總結

並發編程之volatile

VirMach 便宜 VPS

QNews

熱門搜尋