python爬蟲分析報告

在python課上布置的作業,第一次進行爬蟲,走了很多彎路,也學習到了很多知識,藉此記錄。

1. 獲取學堂在線合作院校頁面

要求

爬取學堂在線的電腦類課程頁面內容。
要求將課程名稱、老師、所屬學校和選課人數資訊,保存到一個csv文件中。
鏈接://www.xuetangx.com/search?query=&org=&classify=1&type=&status=&page=1

1.確定目標

打開頁面,通過查看網頁源程式碼並沒有相關內容。可以猜測具體數據由前端通過ajax請求後端具體數據。在開發者工具中,捕獲了如下的json數據:

alt json數據

可以看到這個就是我們要求的json數據。考慮如何獲取json數據並取出來,分析一下瀏覽器的請求,將cURL命令轉換成Python請求如下:

import requests

cookies = {
    'provider': 'xuetang',
    'django_language': 'zh',
}

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0',
    'Accept': 'application/json, text/plain, */*',
    'Accept-Language': 'zh',
    'Content-Type': 'application/json',
    'django-language': 'zh',
    'xtbz': 'xt',
    'x-client': 'web',
    'Origin': '//www.xuetangx.com',
    'Connection': 'keep-alive',
    'Referer': '//www.xuetangx.com/search?query=&org=&classify=1&type=&status=&page=1',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    'TE': 'Trailers',
}

params = (
    ('page', '1'),
)

data = '{query:,chief_org:[],classify:[1],selling_type:[],status:[],appid:10000}'

response = requests.post('//www.xuetangx.com/api/v1/lms/get_product_list/', headers=headers, params=params, cookies=cookies, data=data)

#NB. Original query string below. It seems impossible to parse and
#reproduce query strings 100% accurately so the one below is given
#in case the reproduced version is not "correct".
# response = requests.post('//www.xuetangx.com/api/v1/lms/get_product_list/?page=1', headers=headers, cookies=cookies, data=data)

分析請求的網頁是//curl.trillworks.com/,可以在瀏覽器的開發工具里,選擇network選項卡(chrome)或者網路選項卡(Firefox),右鍵點擊某個請求文件,在菜單中選擇複製→複製為cURL命令,然後到這個網頁中粘貼轉換為python的request即可

2.設計爬蟲

要選取的數據為課程名稱、老師、所屬學校和選課人數。設計的items.py如下:

# items.py
import scrapy


class XuetangItem(scrapy.Item):
    name = scrapy.Field()
    teachers = scrapy.Field()
    school = scrapy.Field()
    count = scrapy.Field()
    pass

接下來是重頭戲設計spider.py文件。因為爬取的是json數據而不是html靜態頁面,需要設計start_requests函數來發送請求。結合之前分析的Python request,具體程式碼如下:

import scrapy
import json
from xuetang.items import XuetangItem


class mySpider(scrapy.spiders.Spider):
    name = "xuetang"
    allowed_domains = ["www.xuetangx.com/"]
    url = "url_pat = '//www.xuetangx.com/api/v1/lms/get_product_list/?page={}'"
    data = '{"query":"","chief_org":[],"classify":["1"],"selling_type":[],"status":[],"appid":10000}'
    # data由分析中得來
    headers = {
        'Host': 'www.xuetangx.com',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0',
        'authority': 'www.xuetangx.com',
        'Accept': 'application/json,text/plain,*/*',
        'Accept-Language': 'zh',
        'Accept-Encoding': 'gzip, deflate, br',
        'django-language': 'zh',
        'xtbz': 'xt',
        'content-type': 'application/json',
        'x-client': 'web',
        'Connection': 'keep-alive',
        'Referer': '//www.xuetangx.com/university/all',
        'Cookie': 'provider=xuetang; django_language=zh',
        'Pragma': 'no-cache',
        'Cache-Control': 'no-cache'
    }
    # 直接從瀏覽器抄過來,防止伺服器辨析到不是瀏覽器而導致失敗

    def start_requests(self):
        for page in range(1, 6):
            yield scrapy.FormRequest(
                url=self.url.format(page),
                headers=self.headers,
                method='POST',
                # 瀏覽器的請求是POST,而且響應頭中寫明只允許POST
                body=self.data,
                callback=self.parse
        )

     def parse(self, response):
        j = json.loads(response.body)
        for each in j['data']['org_list']:
            item = XuetangItem()
            item['name'] = each['name']
            item['school'] = each['org']['name']
            item['count'] = each['count']
            teacher_list = []
            for teacher in each['teacher']:
                teacher_list.append(teacher['name'])
            # 因為有些課程有多個老師,需要逐個保存,寫入一條記錄里
            item['teacher'] = ','.join(teacher_list)
            yield item

然後設計pipelines.py文件,將爬取到的數據保存為csv文件:

import csv


class XuetangPipeline(object):
    dict_data = {'data': []}

    def open_spider(self, spider):
        try:
            self.file = open('data.csv', "w", encoding="utf-8", newline='')
            self.csv = csv.writer(self.file)
        except Exception as err:
            print(err)

    def process_item(self, item, spider):
        self.csv.writerow(list(item.values()))
        return item

    def close_spider(self, spider):
        self.file.close()

這樣就可以就行爬蟲了,當然還要在setting.py中設置ITEM_PIPELINES。之後可以命令行啟動爬蟲,也可以運行執行cmd命令的python文件:

from scrapy import cmdline
cmdline.execute("scrapy crawl xuetang".split())

3.數據展示

保存的csv文件內容如下,正好內容為50條,這裡僅展示開頭一部分:

C++語言程式設計基礎,清華大學,424718,"鄭莉,李超,徐明星"
數據結構(上),清華大學,411298,鄧俊輝
數據結構(下),清華大學,358804,鄧俊輝
……

2. 獲取鏈家二手房資訊

要求:

爬取鏈家官網二手房的數據 //bj.lianjia.com/ershoufang/
要求爬取北京市東城、西城、海淀和朝陽四個城區的數據(每個區爬取5頁),將樓盤名稱、總價、平米數、單價保存到json文件中。

1.確定目標

打開網頁,查看網頁源程式碼,可以看到在源程式碼中間已經包含了二手房資訊,說明頁面由後端渲染完畢後返回到瀏覽器,這樣可以通過Xpath來爬取相關內容。分析一下某個樓盤的資訊結構:

<html>
 <head></head>
 <body>
  <a class="noresultRecommend img LOGCLICKDATA" href="//bj.lianjia.com/ershoufang/101109392759.html" target="_blank" data-log_index="1" data-el="ershoufang" data-housecode="101109392759" data-is_focus="" data-sl="">
   <!-- 熱推標籤、埋點 -->
   <img src="//s1.ljcdn.com/feroot/pc/asset/img/vr/vrgold.png?_v=202011171709034" class="vr_item" /><img class="lj-lazy" src="//image1.ljcdn.com/110000-inspection/pc1_hAjksKeSW_1.jpg.296x216.jpg" data-original="//image1.ljcdn.com/110000-inspection/pc1_hAjksKeSW_1.jpg.296x216.jpg" alt="北京西城長椿街" style="display: block;" /></a>
  <div class="info clear">
   <div class="title">
    <a class="" href="//bj.lianjia.com/ershoufang/101109392759.html" target="_blank" data-log_index="1" data-el="ershoufang" data-housecode="101109392759" data-is_focus="" data-sl="">槐柏樹街南里 南北通透兩居室 精裝修</a>
    <!-- 拆分標籤 只留一個優先順序最高的標籤-->
    <span class="goodhouse_tag tagBlock">必看好房</span>
   </div>
   <div class="flood">
    <div class="positionInfo">
     <span class="positionIcon"></span>
     <a href="//bj.lianjia.com/xiaoqu/1111027374889/" target="_blank" data-log_index="1" data-el="region">槐柏樹街南里 </a> - 
     <a href="//bj.lianjia.com/ershoufang/changchunjie/" target="_blank">長椿街</a> 
    </div>
   </div>
   <div class="address">
    <div class="houseInfo">
     <span class="houseIcon"></span>2室1廳 | 60.81平米 | 南 北 | 精裝 | 中樓層(共6層) | 1991年建 | 板樓
    </div>
   </div>
   <div class="followInfo">
    <span class="starIcon"></span>226人關注 / 1個月以前發布
   </div>
   <div class="tag">
    <span class="subway">近地鐵</span>
    <span class="isVrFutureHome">VR看裝修</span>
    <span class="five">房本滿兩年</span>
    <span class="haskey">隨時看房</span>
   </div>
   <div class="priceInfo">
    <div class="totalPrice">
     <span>600</span>萬
    </div>
    <div class="unitPrice" data-hid="101109392759" data-rid="1111027374889" data-price="98668">
     <span>單價98668元/平米</span>
    </div>
   </div>
  </div>
  <div class="listButtonContainer">
   <div class="btn-follow followBtn" data-hid="101109392759">
    <span class="follow-text">關注</span>
   </div>
   <div class="compareBtn LOGCLICK" data-hid="101109392759" log-mod="101109392759" data-log_evtid="10230">
    加入對比
   </div>
  </div>
 </body>
</html>

可以看到房子的名稱在class=”title”的div下的a標籤內,平米數保存在class=”houseInfo”的div里,但需要截取一下字元串,單價和總價均保存在class=”priceInfo”的div中,有趣的是有些資訊沒有單價顯示,即span里的元素為空,但是觀察到其父元素div內有一個屬性data-price,其值正好等於單價,因此提取這個即可。

2.設計爬蟲

需要保存的數據為樓盤名字、平米數、總價、單價。items.py如下:

import scrapy


class YijiaItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    square = scrapy.Field()
    price = scrapy.Field()
    total = scrapy.Field()
    pass

分析要爬蟲的頁面,網頁提供了選擇區的篩選,點擊「西城區」後網頁地址變為了//bj.lianjia.com/ershoufang/xicheng/,因此可以將網頁地址的變動部分用format去填充。spider.py的內容如下:

from yijia.items import YijiaItem
import scrapy


class mySpider(scrapy.spiders.Spider):
    name = 'lianjia'
    allowed_domains = ["bj.lianjia.com/"]
    url = "//bj.lianjia.com/ershoufang/{}/pg{}/"
    # 第一個地方為地區,第二個為頁數
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Pragma': 'no-cache',
        'Cache-Control': 'no-cache',
    }
    #抄來瀏覽器的header

    def start_requests(self):
        positions = ["dongceng", "xicheng", "chaoyang", "haidian"]
        for position in positions:
            for page in range(1, 6):
                yield scrapy.FormRequest(
                    url=self.url.format(position, page),
                    method="GET",
                    headers=self.headers,
                    callback=self.parse
                )

    def parse(self, response):
        for each in response.xpath("/html/body/div[4]/div[1]/ul/li"):
            item = YijiaItem()
            item['name'] = each.xpath("div[1]/div[1]/a/text()").extract()[0]
            house_info = each.xpath("div[1]/div[3]/div[1]/text()").extract()[0].split('|')
            item['square'] = house_info[1].strip()
            item['total'] = each.xpath("div[1]/div[6]/div[1]/span/text()").extract()[0] + "萬元"
            item['price'] = each.xpath("div[1]/div[6]/div[2]/@data-price").extract()[0] + "元/平米"
            yield item

然後是設計管道文件,將內容保存為一個json文件:

import json


class YijiaPipeline(object):
    dict_data = {'data': []}

    def open_spider(self, spider):
        try:
            self.file = open('data.json', "w", encoding="utf-8")
        except Exception as err:
            print(err)

    def process_item(self, item, spider):
        dict_item = dict(item)
        self.dict_data['data'].append(dict_item)
        return item

    def close_spider(self, spider):
        self.file.write(json.dumps(self.dict_data, ensure_ascii=False, indent=4, separators=(',', ':')))
        self.file.close()

最後仿照前一個樣例進行爬蟲即可。

3.數據展示

保存的json文件內容如下所示,這裡只提供前三條供展示:

{
    "data":[
        {
            "name":"此房南北通透格局,採光視野無遮擋,交通便利",
            "square":"106.5平米",
            "total":"1136萬元",
            "price":"106667元/平米"
        },
        {
            "name":"新安南里 南北通透 2層本房滿五年唯一",
            "square":"55.08平米",
            "total":"565萬元",
            "price":"102579元/平米"
        }
        /*省略之後的N條數據*/
    ]
}