Python爬蟲系統化學習(5)
Python爬蟲系統化學習(5)
多執行緒爬蟲,在之前的網路編程中,我學習過多執行緒socket進行單伺服器對多客戶端的連接,通過使用多執行緒編程,可以大大提升爬蟲的效率。
Python多執行緒爬蟲主要由三部分組成:執行緒的創建,執行緒的定義,執行緒中函數的調用。
執行緒的創建:多通過for循環調用進行,通過thread.start()喚醒執行緒,thread.join()等待執行緒自動阻塞
示例程式碼如下:
for i in range(1,6): thread=MyThread("thread"+str(i),list[i-1]) thread.start() thread_list.append(thread) for thread in thread_list: thread.join()
執行緒的定義:執行緒的定義使用了繼承,通常定義執行緒中包含兩個函數,一個是init初始化函數,在類創建時自動調用,另一個是run函數,在thread.start()函數執行時自動調用,示例程式碼如下:
class MyThread(threading.Thread): def __init__(self,name,link_s): threading.Thread.__init__(self) self.name=name def run(self): print('%s is in Process:'%self.name) #通過spider我們調用了爬蟲函數 spider(self.name,self.links) print('%s is out Process'%self.name)
執行緒中函數的調用是在run裡面進行的,而多執行緒爬蟲的重點就是將多執行緒與爬蟲函數緊密結合起來,這就需要我們為爬蟲們分布任務,也就是每個函數都要爬些什麼內容。
首先我編寫了個寫文件,將貝殼找房的1-300頁南京租房網址鏈接寫入a.txt,程式碼如下:
zurl="//nj.zu.ke.com/zufang/pg" for i in range(101,300): turl=url+str(i)+'\n' print(turl) with open ('a.txt','a+') as f: f.write(turl)
其次在main函數中將這些鏈接寫入元組中
link_list=[] with open('a.txt',"r") as f: file_list=f.readlines() for i in file_list: i=re.sub('\n','',i) link_list.append(i)
此後通過調用link_list[i]就可以為每個爬蟲布置不同的任務了
max=len(link_list) #max為最大頁數 page=0 #page為當前頁數 def spider(threadName, link_range): global page global max while page<=max: i = page page+=1 try: r = requests.get(link_list[i], timeout=20) soup = BeautifulSoup(r.content, "lxml") house_list = soup.find_all("div", class_="content__list--item") for house in house_list: global num num += 1 house_name = house.find('a', class_="twoline").text.strip() house_price = house.find('span', class_="content__list--item-price").text.strip() info ="page:"+str(i)+"num:" + str(num) + threadName + house_name + house_price print(info) except Exception as e: print(threadName, "Error", e)
如此這些執行緒就可以非同步的進行資訊獲取了,整體程式碼如下
#coding=utf-8 import re import requests import threading import time from bs4 import BeautifulSoup page=0 num=0 link_list=[] with open('a.txt',"r") as f: file_list=f.readlines() for i in file_list: i=re.sub('\n','',i) link_list.append(i) max=len(link_list) print(max) class MyThread(threading.Thread): def __init__(self,name): threading.Thread.__init__(self) self.name=name def run(self): print('%s is in Process:'%self.name) spider(self.name) print('%s is out Process'%self.name) max=len(link_list) #max為最大頁數 page=0 #page為當前頁數 def spider(threadName): global page global max while page<=max: i = page page+=1 try: r = requests.get(link_list[i], timeout=20) soup = BeautifulSoup(r.content, "lxml") house_list = soup.find_all("div", class_="content__list--item") for house in house_list: global num num += 1 house_name = house.find('a', class_="twoline").text.strip() house_price = house.find('span', class_="content__list--item-price").text.strip() info ="page:"+str(i)+"num:" + str(num) + threadName + house_name + house_price print(info) except Exception as e: print(threadName, "Error", e) start = time.time() for i in range(1,6): thread=MyThread("thread"+str(i)) thread.start() thread_list.append(thread) for thread in thread_list: thread.join() end=time.time() print("All using time:",end-start)
此外多執行緒爬蟲還可以與隊列方式結合起來,產生全速爬蟲,速度會更快一點:具體完全程式碼如下:
#coding:utf-8 import threading import time import re import requests import queue as Queue link_list=[] with open('a.txt','r') as f: file_list=f.readlines() for each in file_list: each=re.sub('\n','',each) link_list.append(each) class MyThread(threading.Thread): def __init__(self,name,q): threading.Thread.__init__(self) self.name=name self.q=q def run(self): print("%s is start "%self.name) crawel(self.name,self.q) print("%s is end "%self.name) def crawel(threadname,q): while not q.empty(): temp_url=q.get(timeout=1) try: r=requests.get(temp_url,timeout=20) print(threadname,r.status_code,temp_url) except Exception as e: print("Error",e) pass if __name__=='__main__': start=time.time() thread_list=[] thread_Name=['Thread-1','Thread-2','Thread-3','Thread-4','Thread-5'] workQueue=Queue.Queue(1000) #填充隊列 for url in link_list: workQueue.put(url) #創建執行緒 for tname in thread_Name: thread=MyThread(tname,workQueue) thread.start() thread_list.append(thread) for t in thread_list: t.join() end=time.time() print("All using time:",end-start) print("Exiting Main Thread")
使用隊列進行爬蟲需要queue庫,除去執行緒的知識,我們還需要隊列的知識與之結合,上述程式碼中關鍵的隊列知識有創建與填充隊列,調用隊列,持續使用隊列3個,分別如下:
1⃣️:創建與隊列:
workQueue=Queue.Queue(1000) #填充隊列 for url in link_list: workQueue.put(url)
2⃣️:調用隊列:
thread=MyThread(tname,workQueue)
3⃣️:持續使用隊列:
def crawel(threadname,q): while not q.empty(): pass
使用隊列的思想就是先進先出,出完了就結束。
多進程爬蟲:一般來說多進程爬蟲有兩種組合方式:multiprocessing和Pool+Queuex
muiltprocessing使用方法與thread並無多大差異,只需要替換部分程式碼即可,分別為進程的定義與初始化,以及進程的結束。
1⃣️:進程的定義與初始化:
class Myprocess(Process): def __init__(self): Process.__init__(self)
2⃣️:進程的遞歸結束:設置後當父進程結束後,子進程自動會被終止
p.daemon=True
另外一種方法是通過Manager和Pool結合使用
manager=Manager() workQueue=manager.Queue(1000) for url in link_list: workQueue.put(url) pool=Pool(processes=3) for i in range(1,5): pool.apply_async(crawler,args=(workQueue,i)) pool.close() pool.join()