­

python爬蟲之url中的中文問題

在python的爬蟲學習中,我們的url經常出現中文的問題, 我們想要訪問的url就需要對url進行拼接,變成瀏覽器可以識別的url 在python中已經有了這樣的模組了,這就是urlencode urlencode需要對中文和關鍵字組成一對字典,然後解析成我們的url

在python2中是 urllib.urlencode(keyword) 在Python中是 urllib.parse.urlencode(keyword)

查看一下程式碼: python2

import urllib  import  urllib2    #例如我們需要在百度上輸入個關鍵字哈士奇進行查詢,但是哈士奇是中文的,我們需要對哈士奇進行編碼  keyword = {"wd":"哈士奇"}    head_url = "http://www.baidu.com/s"    headers = {      "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"  }    wd = urllib.urlencode(keyword)  url = head_url +"?"+ wd    req = urllib2.Request(url,headers=headers)    response = urllib2.urlopen(req)  html = response.read()  print(url)  print(html.count('哈士奇'))

結果如下:

在python3中:

# -*- coding: utf-8 -*-  # File  : url中出現的中文問題.py  # Author: HuXianyong  # Date  : 2018-09-13 17:39  from urllib import request  import urllib    #例如我們需要在百度上輸入個關鍵字哈士奇進行查詢,但是哈士奇是中文的,我們需要對哈士奇進行編碼  keyword = {"wd":"哈士奇"}    head_url = "http://www.baidu.com/s"    headers = {      "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"  }    wd = urllib.parse.urlencode(keyword)  url = head_url +"?"+ wd    req = request.Request(url,headers=headers)    response = request.urlopen(req)  html = response.read()    print(html.decode().count("哈士奇"))    print(url)

結果如下:

如果需要吧轉換的字元變成中文 可以用unquota

如下: python2

In [25]: dic = {"say":"你好!"}    In [26]: urllib.urlencode(dic)  Out[26]: 'say=%E4%BD%A0%E5%A5%BD%21'    In [27]: aa  = urllib.urlencode(dic)    In [28]: aa  Out[28]: 'say=%E4%BD%A0%E5%A5%BD%21'    In [29]: bb = urllib.unquote(aa)    In [30]: bb  Out[30]: 'say=xe4xbdxa0xe5xa5xbd!'    In [31]: print(bb)  say=你好!

python3

In [16]: dic = {"say":"你好!"}    In [17]: aa = urllib.parse.urlencode(dic)    In [18]: aa  Out[18]: 'say=%E4%BD%A0%E5%A5%BD%21'    In [19]: bb = urllib.parse.unquote(aa)    In [20]: bb  Out[20]: 'say=你好!'

但是如果我們的是post請求數據需要加在data裡面這樣就還需要對data做處理,不然會報字元串的錯:

TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str.

這樣的解決方法是需要加上個編碼 data = urllib.parse.urlencode(formData).encode(encoding="UTF8")

程式碼如下:

# -*- coding: utf-8 -*-  # File  : Ajax爬取豆瓣電影列表.py  # Author: HuXianyong  # Date  : 2018-09-14 14:35    import urllib  from urllib import request    url = "https://movie.douban.com/j/new_search_subjects?"    headers = {      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"  }    formData = {      "sort": "S",      "range": "0,10",      "tags": "電影,魔幻",      "start": "0",      "genres": "劇情"  }    data = urllib.parse.urlencode(formData).encode(encoding="UTF8")    req = request.Request(url=url,data=data,headers=headers)    response = request.urlopen(req)  move_info = response.read().decode()  print(response.read().decode())