爬蟲:HTTP請求與HTML解析(爬取某乎網站)

1. 發送web請求

1.1  requests 

  用requests庫的get()方法發送get請求,常常會添加請求頭”user-agent”,以及登錄”cookie”等參數

1.1.1  user-agent

  登錄網站,將”user-agent”值複製到文本文件

1.1.2  cookie

   登錄網站,將”cookie”值複製到文本文件

1.1.3  測試程式碼

import requests
from requests.exceptions import RequestException

headers = {
    'cookie': '',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
} # 替換為自己的cookie


def get_page(url):
    try:
        html = requests.get(url, headers=headers, timeout=5)
        if html.status_code == 200:
            print('請求成功')
            return html.text
        else:   # 這個else語句不是必須的
            return None
    except RequestException:
        print('請求失敗')


if __name__ == '__main__':
    input_url = '//www.zhihu.com/hot'
    get_page(input_url)

 

1.2  selenium

  多數網站能通過window.navigator.webdriver的值識別selenium爬蟲,因此selenium爬蟲首先要防止網站識別selenium模擬瀏覽器。同樣,selenium請求也常常需要添加請求頭”user-agent”,以及登錄”cookie”等參數

1.2.1  移除Selenium中window.navigator.webdriver的值

  在程式中添加如下程式碼(對應老版本Google)

from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions


option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = Chrome(options=option)
time.sleep(10)

1.2.2  user-agent

  登錄網站,將”user-agent”值複製到文本文件,執行如下程式碼將添加請求頭

from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions


option = ChromeOptions()
option.add_argument('user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"')

1.2.3  cookie

   因為selenium要求cookie需要有”name”,”value”兩個鍵以及對應的值的值,如果網站上面的cookie是字元串的形式,直接複製網站的cookie值將不符合selenium要求,可以用selenium中的get_cookies()方法獲取登錄”cookie”

from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
import time
import json

option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_argument('user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"')
driver = Chrome(options=option)
time.sleep(10)

driver.get('//www.zhihu.com/signin?next=%2F')
time.sleep(30)  
driver.get('//www.zhihu.com/')
cookies = driver.get_cookies()
jsonCookies = json.dumps(cookies)
    
with open('cookies.txt', 'a') as f:  # 文件名和文件位置自己定義
    f.write(jsonCookies)
    f.write('\n')

 

1.2.4  測試程式碼示例

  將上面獲取到的cookie複製到下面程式中便可運行

from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
import time

option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = Chrome(options=option)
time.sleep(10)

driver.get('//www.zhihu.com')
time.sleep(10)

driver.delete_all_cookies()   # 清除剛才的cookie
time.sleep(2)

cookie = {}  #  替換為自己的cookie
driver.add_cookie(cookie)
driver.get('//www.zhihu.com/')
time.sleep(5)
for i in driver.find_elements_by_css_selector('div[itemprop="zhihu:question"] > a'):
    print(i.text)

2. HTML解析(元素定位)

  要爬取到目標數據首先要定位數據所屬元素,BeautifulSoup和selenium都很容易實現對HTML的元素遍歷

2.1  BeautifulSoup元素定位

  下面程式碼BeautifulSoup首先定位到屬性為”HotItem-title”的”h2″標籤,然後再通過.text()方法獲取字元串值

import requests
from requests.exceptions import RequestException

headers = {
    'cookie': '',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}           # 替換為自己的cookie


def get_page(url):
    try:
        html = requests.get(url, headers=headers, timeout=5)
        if html.status_code == 200:
            print('請求成功')
            return html.text
        else:   # 這個else語句不是必須的
            return None
    except RequestException:
        print('請求失敗')

def parse_page(html):
    html = BeautifulSoup(html, "html.parser")
    titles = html.find_all("h2", {'class': 'HotItem-title'})[:10]
    for title in titles:
        print(title.text())


if __name__ == '__main__':
    input_url = '//www.zhihu.com/hot'
    parse_page(get_page(input_url))

 

2.2  selenium元素定位

  selenium元素定位語法形式與requests不太相同,下面程式碼示例(1.2.4 測試程式碼示例)採用了一種層級定位方法:’div[itemprop=”zhihu:question”] > a’,筆者覺得這樣定位比較放心。

  selenium獲取文本值得方法是.text,區別於requests的.text()

from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
import time

option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = Chrome(options=option)
time.sleep(10)

driver.get('//www.zhihu.com')
time.sleep(10)

driver.delete_all_cookies()   # 清除剛才的cookie
time.sleep(2)

cookie = {}  #  替換為自己的cookie
driver.add_cookie(cookie)
driver.get('//www.zhihu.com/')
time.sleep(5)
for i in driver.find_elements_by_css_selector('div[itemprop="zhihu:question"] > a'):
    print(i.text)

 

 

 

 

  

Tags: