python爬蟲-從零開始（二）Urllib庫 - ⎝⎛CodingNote.cc ⎞⎠

python爬蟲—從零開始（二）Urllib庫

2019 年 10 月 5 日
筆記

接上文再繼續我們的爬蟲，這次我們來述說Urllib庫

1，什麼是Urllib庫

　　Urllib庫是python內置的HTTP請求庫

　　urllib.request　　請求模組

　　urllib.error　　　異常處理模組

　　urllib.parse　　 url解析模組

　　urllib.robotparse robots.txt解析模組

　　不需要額外安裝，python自帶的庫。

注意：

python2

 import urllib2  　　response = urllib2.urlopen('http://baidu.com')

python3

 import urllib.request  　　response = urilib.request.urlopen('http://www.baidu.com')

　　python2和python3使用urllib庫還是有一定區別的。

2，方法以及模組：

　　1）request

　　基本運行：（get方式的請求）

 import urllib.request  　　response = urilib.request.urlopen('http://www.baidu.com')  　　print(response.read().decode('utf-8'))

　　運行結果如下：

　　在這裡我們看到，當我們輸入urllib.request.urlopen('http://baidu.com')時，我們會得到一大長串的文本，也就是我們將要從這個得到的文本里得到我們所需要的數據。

　　帶有請求參數：（post方式的請求）

 import urllib.request   import urllib.parse     data = bytes(urllib.parse.urlencode({'username':'cainiao'}),encoding='utf8')  　　response = urllib.request.urlopen('http://httpbin.org/post',data = data)  　　print(response.read())

　在這裡我們不難看出，我們給予的data username參數已經傳遞過去了。

注意data必須為bytes類型

　　設置請求超時時間：

 import urllib.request  　　response = urllib.request.urlopen('http://httpbin.org/get', timeout = 1)  　　print(response.read())

這時我們看到，執行程式碼時報出timed out錯誤。我們這時可以使用urllib.error模組，程式碼如下

  import urllib.request   ipmort urllib.error   try:  　　　　response = urllib.request.urlopen('http://httpbin.org/get', timeout = 0.1)  　　　　print(response.read())     except urllib.error.URLError as e:  　　　　print('鏈接超時啦～！') # 這裡我們沒有判斷錯誤類型，可以自行加入錯誤類型判斷，然後在進行輸出。

說到這，我們就把最簡單，最基礎的urlopen的基礎全都說完了，有能力的小夥伴，可以進行詳細閱讀其源碼，更深入的了解該方法。

　　2）響應 response

 import urllib.request  　　response = urllib.request.urlopen('http://www.baidu.com')  　　print(type(response))  　　# 得到一個類型為<class 'http.client.HTTPResponse'>

 import urllib.request  　　response = urllib.request.urlopen('http://www.baidu.com')  　　print(type(response)) # 響應類型  　　print(response.status) #上篇文章提到的狀態碼  　　print(response.getheaders)  # 請求頭  　　print(response.getheader('Server')) # 取得請求頭參數

 import urllib.request  　　response = urllib.request.urlopen('http://www.baidu.com')  　　print(response.read().decode('utf-8')) # 響應體，響應內容

　　響應體為位元組流形式的內容，我們需要調用decode(decode('utf-8'))進行轉碼。

常用的post請求基本寫法

 from urllib import request,parse  　　url = 'http://httpbin.org/post'  　　headers = {  　　　　'User-Agent':'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)',  　　　　'Host':'httpbin.org'  　　}   dict = {  　　　　'name':'cxiaocai'  　　}  　　data = bytes(parse.urlencode(dict),encoding='utf8')  　　req = request.Request(url =url , data = data , headers = headers , method = 'POST')  　　response = request.urlopen(req)  　　print(response.read().decode('utf-8'))

　　也可以寫成這樣的

 from urllib import request,parse  　　url = 'http://httpbin.org/post'  　　dict = {  　　　　'name':'cxiaocai'  　　}  　　data = bytes(parse.urlencode(dict),encoding='utf8')  　　req = request.Request(url =url , data = data , headers = headers , method = 'POST')  　　req.add_header('User-Agent':'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)')  　　response = request.urlopen(req)  　　print(response.read().decode('utf-8'))

　　說到這裡，我們最基本的urllib請求就可以基本完成了，很大一部分網站也可以進行爬取了。

3，代理設置

　　代理設置我們這裡簡單的說一下，後面的部落格我們會用實際爬蟲來說明這個。

Hander代理

 import urllib.request  　　proxy_hander = urllib.request.ProxyHeader({  　　　　'http':'http://127.0.0.1:1111',  　　　　'https':'https://127.0.0.1:2222'  　　})  　　opener = urllib.request.build_opener(proxy_hander)  　　response = opener.open('http://www.baidu.com')  　　print(response.read()) # 我這沒有代理，沒有測試該方法。

Cookie設置

 import http.cookiejar, urllib.request   cookie = http.cookiejar.CookieJar()   hander = urllib.request.HTTPCookieProcessor(cookie)  　　opener = urllib.request.build_opener(hander)  　　response = opener.open("http://www.baidu.com")    　　for item in cookie:  　　　　print(item.name + "=" + item.value)

例如某些網站是需要登陸的，所有我們在這裡需要設置Cookie

　　我們也可以將Cookie保存為文本文件，便於多次進行讀取。

　　import http.cookiejar, urllib.request  　　filename = 'cookie.txt'  　　cookie = http.cookiejar.MozillaCookieJar(filename)  　　hander = urllib.request.HTTPCookieProcessor(cookie)  　　opener = urllib.request.build_opener(hander)  　　response = opener.open("http://www.baidu.com")  　　cookie.save(ignore_discard=True, ignore_expires=True)

　　程式碼運行以後會在項目目錄下生成一個cookie.txt

　　另外一種Cookie的保存格式

　　import http.cookiejar, urllib.request  　　filename = 'cookie.txt'  　　cookie = http.cookiejar.LWPCookieJar(filename)  　　hander = urllib.request.HTTPCookieProcessor(cookie)  　　opener = urllib.request.build_opener(hander)  　　response = opener.open("http://www.baidu.com")  　　cookie.save(ignore_discard=True, ignore_expires=True)

　運行程式碼以後也會生成一個txt文件，格式如下

下面我們來讀取我們過程保存的Cookie文件

import http.cookiejar, urllib.request  cookie = http.cookiejar.LWPCookieJar()  cookie.load('cookie.txt',ignore_expires=True,ignore_discard=True)  hander = urllib.request.HTTPCookieProcessor(cookie)  opener = urllib.request.build_opener(hander)  response = opener.open('http://www.baidu.com')  print(response.read().decode('utf-8'))

4，異常處理　　簡單事例，在這裡我們來訪問一個不存在的網站

from urllib import request,error  try:   response = request.urlopen('https://www.cnblogs.com/cxiaocai/articles/index123.html')  except error.URLError as e:   print(e.reason)

　這裡我們知道這個網站根本不存在的，會報錯，我們捕捉該異常可以保證程式繼續運行，我們可以執行重試操作　我們也可以查看官網 https://docs.python.org/3/library/urllib.error.html#module-urllib.error 5，URL解析　　urlparse模組　　主要用戶解析URL的模組，下面我們先來一個簡單的示例

from urllib.parse import urlparse  result = urlparse('https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1')  print(type(result),result)

這裡我們看下輸出結果：　　該方法可以進行url的拆分　　也可以制定請求方式http，或者https方式請求

from urllib.parse import urlparse  result = urlparse('www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1',scheme='https')  print(result)

　輸出結果如下所示：　　在這裡我們看到了，請求被制定了https請求　　我們會看到輸出結果里包含一個fragents，我們想將framents拼接到query後面，我們可以這樣來做

from urllib.parse import urlparse  result = urlparse('http://www.baidu.com/index.html;user?id=5#commont',allow_fragments=False)  print(result)

　　輸出結果為　　如果沒有frament，則拼接到path內　　示例：　　　　我們現在知道了URl怎麼進行拆分，如果我們得到了URl的集合，例如這樣dada = ['http','www.baidu.com','index.html','user','a=6','comment'] 我們可以使用urlunparse 　　還有urljoin，主要是來進行url的拼接的，接下來我們來看下我們的示例：以後面的為基準，如果有就留下，如果沒有就從前面取。　　如果我們的有了一個字典類型的參數，和一個url，我們想發起get請求（上一期說過get請求傳參），我們可以這樣來做，在這裡我們需要注意的是，url地址後面需要自行加一個『？』。最後還有一個urllib.robotparser，主要用robot.txt文件的官網有一些示例，由於這個不常用，在這裡我做過多解釋。官網地址：https://docs.python.org/3/library/urllib.robotparser.html#module-urllib.robotparser 感興趣的小夥伴可以自行閱讀官方文檔。到這裡我們就把urllib的基本用法全部說了一遍，可以自己嘗試寫一些爬蟲程式了（先用正則解析，後期我們有更簡單的方法）。想更深入的研讀urllib庫，可以直接登陸官方網站直接閱讀其源碼。官網地址： https://docs.python.org/3/library/urllib.html 注意：很多小夥伴看到我的程式碼直接複製過去，但發現直接粘貼會報錯，還需要自己刪除多餘的空行，在這裡我並不建議你們複製粘貼，後期我們整理一個github供大家直接使用。下一篇文章我會弄一篇關於Requests包的使用，個人感覺比urllib更好用，敬請期待。　　感謝大家的閱讀，不正確的地方，還希望大家來斧正，鞠躬，謝謝?。

python爬蟲—從零開始（二）Urllib庫

VirMach 便宜 VPS

QNews

python爬蟲—從零開始（二）Urllib庫

分享此文：

Related Posts

CODING 簽約假面科技,助力打造互動娛樂場景的創新實驗室

《Python測試開發技術棧—巴哥職場進化記》—軟體測試工程師「兵器庫」

如何在IE瀏覽器播放RTSP或RTMP流

mysql調優工具

VirMach 便宜 VPS

QNews

熱門搜尋