Python爬虫系统化学习(5)

Python爬虫系统化学习(5)

多线程爬虫,在之前的网络编程中,我学习过多线程socket进行单服务器对多客户端的连接,通过使用多线程编程,可以大大提升爬虫的效率。

Python多线程爬虫主要由三部分组成:线程的创建,线程的定义,线程中函数的调用。

线程的创建:多通过for循环调用进行,通过thread.start()唤醒线程,thread.join()等待线程自动阻塞

示例代码如下:

for i in range(1,6):
    thread=MyThread("thread"+str(i),list[i-1])
    thread.start()
    thread_list.append(thread)
for thread in thread_list:
    thread.join()

线程的定义:线程的定义使用了继承,通常定义线程中包含两个函数,一个是init初始化函数,在类创建时自动调用,另一个是run函数,在thread.start()函数执行时自动调用,示例代码如下:

class MyThread(threading.Thread):
    def __init__(self,name,link_s):
        threading.Thread.__init__(self)
        self.name=name
    def run(self):
        print('%s is in Process:'%self.name)
        #通过spider我们调用了爬虫函数
        spider(self.name,self.links)
        print('%s is out Process'%self.name)

线程中函数的调用是在run里面进行的,而多线程爬虫的重点就是将多线程与爬虫函数紧密结合起来,这就需要我们为爬虫们分布任务,也就是每个函数都要爬些什么内容。

首先我编写了个写文件,将贝壳找房的1-300页南京租房网址链接写入a.txt,代码如下:

zurl="//nj.zu.ke.com/zufang/pg"
for i in  range(101,300):
    turl=url+str(i)+'\n'
    print(turl)
    with open ('a.txt','a+') as f:
        f.write(turl)

其次在main函数中将这些链接写入元组中

link_list=[]
with open('a.txt',"r") as f:
    file_list=f.readlines()
    for i in file_list:
        i=re.sub('\n','',i)
        link_list.append(i)

此后通过调用link_list[i]就可以为每个爬虫布置不同的任务了

max=len(link_list) #max为最大页数
page=0 #page为当前页数
def spider(threadName, link_range):
    global page
    global max
    while page<=max:
        i = page
        page+=1
        try:
            r = requests.get(link_list[i], timeout=20)
            soup = BeautifulSoup(r.content, "lxml")
            house_list = soup.find_all("div", class_="content__list--item")
            for house in house_list:
                global num
                num += 1
                house_name = house.find('a', class_="twoline").text.strip()
                house_price = house.find('span', class_="content__list--item-price").text.strip()
                info ="page:"+str(i)+"num:" + str(num) + threadName + house_name + house_price
                print(info)
        except Exception as e:
            print(threadName, "Error", e)

如此这些线程就可以异步的进行信息获取了,整体代码如下

#coding=utf-8
import re
import requests
import threading
import time
from bs4 import BeautifulSoup
page=0
num=0
link_list=[]
with open('a.txt',"r") as f:
    file_list=f.readlines()
    for i in file_list:
        i=re.sub('\n','',i)
        link_list.append(i)
max=len(link_list)
print(max)
class MyThread(threading.Thread):
    def __init__(self,name):
        threading.Thread.__init__(self)
        self.name=name
    def run(self):
        print('%s is in Process:'%self.name)
        spider(self.name)
        print('%s is out Process'%self.name)
max=len(link_list) #max为最大页数
page=0 #page为当前页数
def spider(threadName):
    global page
    global max
    while page<=max:
        i = page
        page+=1
        try:
            r = requests.get(link_list[i], timeout=20)
            soup = BeautifulSoup(r.content, "lxml")
            house_list = soup.find_all("div", class_="content__list--item")
            for house in house_list:
                global num
                num += 1
                house_name = house.find('a', class_="twoline").text.strip()
                house_price = house.find('span', class_="content__list--item-price").text.strip()
                info ="page:"+str(i)+"num:" + str(num) + threadName + house_name + house_price
                print(info)
        except Exception as e:
            print(threadName, "Error", e)
start = time.time()
for i in range(1,6):
    thread=MyThread("thread"+str(i))
    thread.start()
    thread_list.append(thread)
for thread in thread_list:
    thread.join()
end=time.time()
print("All using time:",end-start)

此外多线程爬虫还可以与队列方式结合起来,产生全速爬虫,速度会更快一点:具体完全代码如下:

#coding:utf-8
import threading
import time
import re
import requests
import queue as Queue
link_list=[]
with open('a.txt','r') as f:
   file_list=f.readlines()
   for each in file_list:
      each=re.sub('\n','',each)
      link_list.append(each)
class MyThread(threading.Thread):
   def __init__(self,name,q):
      threading.Thread.__init__(self)
      self.name=name
      self.q=q
   def run(self):
      print("%s is start "%self.name)
      crawel(self.name,self.q)
      print("%s is end "%self.name)
def crawel(threadname,q):
   while not q.empty():
      temp_url=q.get(timeout=1)
      try:
         r=requests.get(temp_url,timeout=20)
         print(threadname,r.status_code,temp_url)
      except Exception as e:
         print("Error",e)
         pass
if __name__=='__main__':
   start=time.time()
   thread_list=[]
   thread_Name=['Thread-1','Thread-2','Thread-3','Thread-4','Thread-5']
   workQueue=Queue.Queue(1000)
   #填充队列
   for url in link_list:
      workQueue.put(url)
   #创建线程
   for tname in thread_Name:
      thread=MyThread(tname,workQueue)
      thread.start()
      thread_list.append(thread)
   for t in thread_list:
      t.join()
   end=time.time()
   print("All using time:",end-start)
   print("Exiting Main Thread")

使用队列进行爬虫需要queue库,除去线程的知识,我们还需要队列的知识与之结合,上述代码中关键的队列知识有创建与填充队列,调用队列,持续使用队列3个,分别如下:

1⃣️:创建与队列:

workQueue=Queue.Queue(1000)
   #填充队列
   for url in link_list:
      workQueue.put(url)

2⃣️:调用队列:

thread=MyThread(tname,workQueue)

3⃣️:持续使用队列:

def crawel(threadname,q):
   while not q.empty():
      pass

使用队列的思想就是先进先出,出完了就结束。

多进程爬虫:一般来说多进程爬虫有两种组合方式:multiprocessing和Pool+Queuex

muiltprocessing使用方法与thread并无多大差异,只需要替换部分代码即可,分别为进程的定义与初始化,以及进程的结束。

1⃣️:进程的定义与初始化:

class Myprocess(Process):
    def __init__(self):
        Process.__init__(self)

2⃣️:进程的递归结束:设置后当父进程结束后,子进程自动会被终止

p.daemon=True

另外一种方法是通过Manager和Pool结合使用

manager=Manager()
workQueue=manager.Queue(1000)
for url in link_list:
    workQueue.put(url)
pool=Pool(processes=3)
for i in range(1,5):
    pool.apply_async(crawler,args=(workQueue,i))
pool.close()
pool.join()

 

Tags: