二手房價格及信息爬取

2019 年 12 月 24 日
筆記

本文為讀者投稿，作者：董匯標MINUS 知乎：https://zhuanlan.zhihu.com/p/97235643

有一天，哥們群里聊到買房問題，因為都上班沒幾年，荷包還不夠重。

然後我就想可以參考某家數據研究研究，房價什麼情況了。

爬取準備

某家網站里有新房、二手房、租房等待。如果買房，尤其是在北京的首套，可能二手房是選擇之一，那我就針對二手房研究一下。

雖然網上有很多數據源或者教程，但我還是打算重新抓一遍，一是保持數據是最新的，而是練手不犯懶。

需要技能：BeautifulSoup解析數據–正則表達式提取數據–csv存儲數據

爬蟲思路：常規網站爬蟲思路

上圖是某家二手房展示頁面其中一套房的信息，我需要提取它的：位置、幾室幾廳、平米數、朝向、裝修風格、層數、建造年份、建築形式、售價。

然後通過HTML分析，找到他們對應的字段（這塊爬蟲教學裏很多，不贅述）

from bs4 import BeautifulSoup  import re  import csv  import requests  import pandas as pd  from random import choice  import time

因為鏈家二手房一共100頁，所以很明確的新建一個csv文檔，把它名字取好，把列設置好。

columns = ['小區', '地區', '廳','平米數','方向','狀態','層','build-year','形式','錢','單位','網址','推薦語']  # 如果文件由第一行,就不用了  with open('鏈家二手房100頁.csv', 'w', newline='', encoding='utf-8') as file:      writer = csv.writer(file, delimiter=',')      writer.writerow(columns)      file.close()

數據爬取

1. 100頁那就寫一個循環進行取數

2. 用BeautifulSoup進行頁面數據解析

3. 通過正則表達式提取對應字段位置

4. 把提取的信息寫入剛剛說的csv文件

5. 為了方便知道進度，在每頁結尾加上打印，知道進度

6. 為了防止"給鏈家服務器帶來壓力「選擇每頁隔幾秒再切換下頁

7. 所有爬完了，打印一個fin~（這是我個人習慣）

i=1  for j in range(100):      urll = base_url1+ str(i) +base_url2      print(urll)      i += 1      get_page=requests.get(urll)      bs_page = BeautifulSoup(get_page.text, 'html.parser')      list_house = bs_page.findAll('div', class_='info clear')      for house_one in list_house:            house_info    = house_one.find_all('div', class_='houseInfo')          position_info = house_one.find_all('div', class_='positionInfo')          totalPrice    = house_one.find_all('div', class_='totalPrice')          href          = house_one.find_all('div', class_='title')            # 正則提取          # 小區名,位置          position_str  =re.findall('_blank">(.+)</a.+_blank">(.+)?</a', str(position_info))          position_str1 =list(position_str[0])          # print(type(position_str1),position_str1)            # 房子信息          house_info_str=re.findall('span>(.+)?</div>', str(house_info))          house_info_str = str(house_info_str)[2:-2].split('|')          # print(type(house_info_str), house_info_str)              totalPrice_str=re.findall('<span>(.+)</span>(.+)</div>', str(totalPrice))          totalPrice_str = list(totalPrice_str[0])          # print(type(totalPrice_str), totalPrice_str)              href_str      =re.findall('http.+html', str(href))          # print(type(href_str), href_str)            AD_str = re.findall('_blank">(.+)?</a>', str(href))          # print(type(AD_str), AD_str)            house_all = position_str1 + house_info_str + totalPrice_str + href_str + AD_str            print(house_all)            # writer.writerow()          with open('鏈家新房100個.csv', 'a', newline='', encoding='utf-8') as file:              writer = csv.writer(file, delimiter=',')              writer.writerow(house_all)              file.close()        print(f'---第{i}頁---')      times = choice([3, 4, 5, 6])      print(f'sleep{times}n')      time.sleep(times)  print('fin')

數據概況

當上邊數據跑完了後，可以看到一個表格，打開後數據情況如下：

可以看到，小區名、地點、房型、平米數、方向、層數、建造年代、樓房形式、售價、對應詳情頁網址就都有啦~

如何利用這些數據進行數據分析？

見本公眾號另一篇文章。

本文為讀者投稿，作者：董匯標MINUS，首發於知乎，原文地址可以點擊左下角原文鏈接。

二手房價格及信息爬取

VirMach 便宜 VPS

QNews

二手房價格及信息爬取

分享此文：

Related Posts

2020 最新整理的 50 到 Java 線程面試題！

到底什麼是TORCH.NN？

swoole運行模式加速laravel應用的詳細介紹

如何有效利用企業資源，發揮Scrum最大優勢？

VirMach 便宜 VPS

QNews

熱門搜尋