scarpy爬蟲框架

2020 年 4 月 11 日
筆記

架構介紹
安裝創建和啟動
配置文件目錄介紹
爬取數據，並解析
數據持久化
- 保存到文件
- 保存到redis
動作鏈，控制滑動的驗證碼

架構介紹

Scrapy一個開源和協作的框架，其最初是為了頁面抓取 (更確切來說, 網路抓取 )所設計的，使用它可以以快速、簡單、可擴展的方式從網站中提取所需的數據。但目前Scrapy的用途十分廣泛，可用於如數據挖掘、監測和自動化測試等領域，也可以應用在獲取API所返回的數據(例如 Amazon Associates Web Services ) 或者通用的網路爬蟲。

Scrapy 是基於twisted框架開發而來，twisted是一個流行的事件驅動的python網路框架。因此Scrapy使用了一種非阻塞（又名非同步）的程式碼來實現並發。整體架構大致如下

IO多路復用

# 引擎(EGINE)（大總管）  引擎負責控制系統所有組件之間的數據流，並在某些動作發生時觸發事件。有關詳細資訊，請參見上面的數據流部分。  # 調度器(SCHEDULER)  用來接受引擎發過來的請求, 壓入隊列中, 並在引擎再次請求的時候返回. 可以想像成一個URL的優先順序隊列, 由它來決定下一個要抓取的網址是什麼, 同時去除重複的網址  # 下載器(DOWLOADER)  用於下載網頁內容, 並將網頁內容返回給EGINE，下載器是建立在twisted這個高效的非同步模型上的  # 爬蟲(SPIDERS)  SPIDERS是開發人員自定義的類，用來解析responses，並且提取items，或者發送新的請求  # 項目管道(ITEM PIPLINES)  在items被提取後負責處理它們，主要包括清理、驗證、持久化（比如存到資料庫）等操作      # 兩個中間件  -爬蟲中間件  -下載中間件（用的最多，加頭，加代理，加cookie，集成selenium）

安裝創建和啟動

# 1 框架 不是 模組  # 2 號稱爬蟲界的django（你會發現，跟django很多地方一樣）  # 3 安裝  	-mac，linux平台：pip3 install scrapy    -windows平台：pip3 install scrapy（大部分人可以）    	- 如果失敗：        1、pip3 install wheel #安裝後，便支援通過wheel文件安裝軟體，wheel文件官網：https://www.lfd.uci.edu/~gohlke/pythonlibs        3、pip3 install lxml        4、pip3 install pyopenssl        5、下載並安裝pywin32：https://sourceforge.net/projects/pywin32/files/pywin32/        6、下載twisted的wheel文件：http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted        7、執行pip3 install 下載目錄Twisted-17.9.0-cp36-cp36m-win_amd64.whl        8、pip3 install scrapy   # 4 在script文件夾下會有scrapy.exe可執行文件  	-創建scrapy項目：scrapy startproject 項目名   (django創建項目)    	-創建爬蟲：scrapy genspider 爬蟲名 要爬取的網站地址   # 可以創建多個爬蟲     # 5 命令啟動爬蟲  		-scrapy crawl 爬蟲名字    		-scrapy crawl 爬蟲名字 --nolog   # 沒有日誌輸出啟動   # 6 文件執行爬蟲(推薦使用)  	-在項目路徑下創建一個main.py,右鍵執行即可    	from scrapy.cmdline import execute      # execute(['scrapy','crawl','chouti','--nolog'])  # 沒有設置日誌級別      execute(['scrapy','crawl','chouti'])			  # 設置了日誌級別

配置文件目錄介紹

-crawl_chouti   # 項目名    -crawl_chouti # 跟項目一個名，文件夾      -spiders    # spiders：放著爬蟲  genspider生成的爬蟲，都放在這下面      	-__init__.py        -chouti.py # 抽屜爬蟲        -cnblogs.py # cnblogs 爬蟲      -items.py     # 對比django中的models.py文件 ,寫一個個的模型類      -middlewares.py  # 中間件（爬蟲中間件，下載中間件），中間件寫在這      -pipelines.py   # 寫持久化的地方（持久化到文件，mysql，redis，mongodb）      -settings.py    # 配置文件    -scrapy.cfg       # 不用關注，上線相關的          # 配置文件settings.py  ROBOTSTXT_OBEY = False   # 是否遵循爬蟲協議，強行運行  USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'    # 請求頭中的ua,去瀏覽器複製，或者用ua池拿  LOG_LEVEL='ERROR' # 這樣配置，程式錯誤資訊才會列印，  	#啟動爬蟲直接 scrapy crawl 爬蟲名   就沒有日誌輸出    	# scrapy crawl 爬蟲名 --nolog  # 配置了就不需要這樣啟動了        # 爬蟲文件  class ChoutiSpider(scrapy.Spider):      name = 'chouti'   # 爬蟲名字      allowed_domains = ['https://dig.chouti.com/']  # 允許爬取的域，想要多爬就注釋掉      start_urls = ['https://dig.chouti.com/']   # 起始爬取的位置，爬蟲一啟動，會先向它發請求        def parse(self, response):  # 解析，請求回來，自動執行parser，在這個方法中做解析          print('---------------------------',response)

爬取數據，並解析

# 1 解析，可以使用bs4解析  from bs4 import BeautifulSoup  soup=BeautifulSoup(response.text,'lxml')  soup.find_all()  # bs4解析  soup.select()  # css解析    # 2 內置的解析器  response.css  response.xpath    # 內置解析    # 所有用css或者xpath選擇出來的都放在列表中    # 取第一個:extract_first()    # 取出所有extract()  # css選擇器取文本和屬性：      # .link-title::text  # 取文本，數據都在data中      # .link-title::attr(href)   # 取屬性，數據都在data中  # xpath選擇器取文本和屬性      # .//a[contains(@class,"link-title")/text()]      #.//a[contains(@class,"link-title")/@href]    # 內置css選擇期，取所有  div_list = response.css('.link-con .link-item')  for div in div_list:      content = div.css('.link-title').extract()      print(content)

數據持久化

# 方式一（不推薦）    -1 parser解析函數，return 列表，列表套字典      # 命令   (支援：('json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle')      # 數據到aa.json文件中    -2 scrapy crawl chouti -o aa.json  # 程式碼：  lis = []  for div in div_list:      content = div.select('.link-title')[0].text      lis.append({'title':content})      return lis      # 方式二 pipline的方式（管道）     -1 在items.py中創建模型類     -2 在爬蟲中chouti.py，引入，把解析的數據放到item對象中（要用中括弧）     -3 yield item對象     -4 配置文件配置管道         ITEM_PIPELINES = {          # 數字表示優先順序（數字越小，優先順序越大）         'crawl_chouti.pipelines.CrawlChoutiPipeline': 300,         'crawl_chouti.pipelines.CrawlChoutiRedisPipeline': 301，      	}    -5 pipline.py中寫持久化的類          spider_open  # 方法，一開始就打開文件          process_item # 方法，寫入文件          spider_close # 方法，關閉文件

保存到文件

# choutiaa.py 爬蟲文件  import scrapy  from chouti.items import ChoutiItem  # 導入模型類  class ChoutiaaSpider(scrapy.Spider):      name = 'choutiaa'      # allowed_domains = ['https://dig.chouti.com/']   # 允許爬取的域      start_urls = ['https://dig.chouti.com//']   # 起始爬取位置      # 解析，請求回來，自動執行parse，在這個方法中解析      def parse(self, response):          print('----------------',response)          from bs4 import BeautifulSoup          soup = BeautifulSoup(response.text,'lxml')          div_list = soup.select('.link-con .link-item')            for div in div_list:              content = div.select('.link-title')[0].text              href = div.select('.link-title')[0].attrs['href']              item = ChoutiItem()  # 生成模型對象              item['content'] = content  # 添加值              item['href'] = href              yield item  # 必須用yield    # items.py 模型類文件  import scrapy  class ChoutiItem(scrapy.Item):      content = scrapy.Field()      href = scrapy.Field()    # pipelines.py 數據持久化文件  class ChoutiPipeline(object):      def open_spider(self, spider):          # 一開始就打開文件          self.f = open('a.txt', 'w', encoding='utf-8')        def process_item(self, item, spider):          # print(item)          # 寫入文件的操作          self.f.write(item['content'])          self.f.write(item['href'])          self.f.write('n')          return item        def close_spider(self, spider):          # 寫入完畢，最後關閉文件          self.f.close()    # setting.py  ITEM_PIPELINES = {      # 數字表示優先順序，越小優先順序越高     'chouti.pipelines.ChoutiPipeline': 300,     'chouti.pipelines.ChoutiRedisPipeline': 301,  }

保存到redis

# settings.ps  ITEM_PIPELINES = {      # 數字表示優先順序，越小優先順序越高     'chouti.pipelines.ChoutiPipeline': 300,     'chouti.pipelines.ChoutiRedisPipeline': 301,  }    # pipelines.py  # 保存到redis  from redis import Redis  class ChoutiRedisPipeline(object):      def open_spider(self, spider):          # 不寫參數就用默認配置          self.conn = Redis(password='123')  # 一開始就拿到redis對象        def process_item(self, item, spider):          print(item)          import json          s = json.dumps({'content': item['content'], 'href': item['href']})          self.conn.hset('choudi_article', item['id'], s)            return item        def close_spider(self, spoder):          pass          # self.conn.close()    # chouti.py  import scrapy  from chouti.items import ChoutiItem  # 導入模型類  class ChoutiaaSpider(scrapy.Spider):      name = 'choutiaa'      # allowed_domains = ['https://dig.chouti.com/']   # 允許爬取的域      start_urls = ['https://dig.chouti.com//']   # 起始爬取位置      # 解析，請求回來，自動執行parse，在這個方法中解析      def parse(self, response):          print('----------------',response)          from bs4 import BeautifulSoup          soup = BeautifulSoup(response.text,'lxml')          div_list = soup.select('.link-con .link-item')            for div in div_list:              content = div.select('.link-title')[0].text              href = div.select('.link-title')[0].attrs['href']              id = div.attrs['data-id']              item = ChoutiItem()  # 生成模型對象              item['content'] = content  # 添加值              item['href'] = href              item['id'] = id              yield item  # 必須用yield

動作鏈，控制滑動的驗證碼

from selenium import webdriver  from selenium.webdriver import ActionChains  import time  bro=webdriver.Chrome(executable_path='./chromedriver')  bro.get('https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable')  bro.implicitly_wait(10)    #切換frame（很少）  bro.switch_to.frame('iframeResult')  div=bro.find_element_by_xpath('//*[@id="draggable"]')    # 1 生成一個動作練對象  action=ActionChains(bro)  # 2 點擊並夯住某個控制項  action.click_and_hold(div)  # 3 移動（三種方式）  # action.move_by_offset() # 通過坐標（x,y）  # action.move_to_element() # 到另一個標籤  # action.move_to_element_with_offset() # 到另一個標籤，再偏移一部分      for i in range(5):      action.move_by_offset(10,10)    # 4 真正的移動  action.perform()    # 5 釋放控制項（鬆開滑鼠）  action.release()

scarpy爬蟲框架

架構介紹

安裝創建和啟動

配置文件目錄介紹

爬取數據，並解析

數據持久化

保存到文件

保存到redis

動作鏈，控制滑動的驗證碼

VirMach 便宜 VPS

QNews

scarpy爬蟲框架

架構介紹

安裝創建和啟動

配置文件目錄介紹

爬取數據，並解析

數據持久化

保存到文件

保存到redis

動作鏈，控制滑動的驗證碼

分享此文：

Related Posts

IIS 配置集中式證書模組實現網站自動綁定證書文件

HCNP Routing&Switching之IS-IS鄰居建立、LSDB同步、拓撲計算和路由形成

Netty：Channel

使用vant的時候，報錯：component has been registered but not used以及vant的使用方法總結

VirMach 便宜 VPS

QNews

熱門搜尋