二手房价格及信息爬取

2019 年 12 月 24 日
筆記

本文为读者投稿，作者：董汇标MINUS 知乎：https://zhuanlan.zhihu.com/p/97235643

有一天，哥们群里聊到买房问题，因为都上班没几年，荷包还不够重。

然后我就想可以参考某家数据研究研究，房价什么情况了。

爬取准备

某家网站里有新房、二手房、租房等待。如果买房，尤其是在北京的首套，可能二手房是选择之一，那我就针对二手房研究一下。

虽然网上有很多数据源或者教程，但我还是打算重新抓一遍，一是保持数据是最新的，而是练手不犯懒。

需要技能：BeautifulSoup解析数据–正则表达式提取数据–csv存储数据

爬虫思路：常规网站爬虫思路

上图是某家二手房展示页面其中一套房的信息，我需要提取它的：位置、几室几厅、平米数、朝向、装修风格、层数、建造年份、建筑形式、售价。

然后通过HTML分析，找到他们对应的字段（这块爬虫教学里很多，不赘述）

from bs4 import BeautifulSoup  import re  import csv  import requests  import pandas as pd  from random import choice  import time

因为链家二手房一共100页，所以很明确的新建一个csv文档，把它名字取好，把列设置好。

columns = ['小区', '地区', '厅','平米数','方向','状态','层','build-year','形式','钱','单位','网址','推荐语']  # 如果文件由第一行,就不用了  with open('链家二手房100页.csv', 'w', newline='', encoding='utf-8') as file:      writer = csv.writer(file, delimiter=',')      writer.writerow(columns)      file.close()

数据爬取

1. 100页那就写一个循环进行取数

2. 用BeautifulSoup进行页面数据解析

3. 通过正则表达式提取对应字段位置

4. 把提取的信息写入刚刚说的csv文件

5. 为了方便知道进度，在每页结尾加上打印，知道进度

6. 为了防止"给链家服务器带来压力“选择每页隔几秒再切换下页

7. 所有爬完了，打印一个fin~（这是我个人习惯）

i=1  for j in range(100):      urll = base_url1+ str(i) +base_url2      print(urll)      i += 1      get_page=requests.get(urll)      bs_page = BeautifulSoup(get_page.text, 'html.parser')      list_house = bs_page.findAll('div', class_='info clear')      for house_one in list_house:            house_info    = house_one.find_all('div', class_='houseInfo')          position_info = house_one.find_all('div', class_='positionInfo')          totalPrice    = house_one.find_all('div', class_='totalPrice')          href          = house_one.find_all('div', class_='title')            # 正则提取          # 小区名,位置          position_str  =re.findall('_blank">(.+)</a.+_blank">(.+)?</a', str(position_info))          position_str1 =list(position_str[0])          # print(type(position_str1),position_str1)            # 房子信息          house_info_str=re.findall('span>(.+)?</div>', str(house_info))          house_info_str = str(house_info_str)[2:-2].split('|')          # print(type(house_info_str), house_info_str)              totalPrice_str=re.findall('<span>(.+)</span>(.+)</div>', str(totalPrice))          totalPrice_str = list(totalPrice_str[0])          # print(type(totalPrice_str), totalPrice_str)              href_str      =re.findall('http.+html', str(href))          # print(type(href_str), href_str)            AD_str = re.findall('_blank">(.+)?</a>', str(href))          # print(type(AD_str), AD_str)            house_all = position_str1 + house_info_str + totalPrice_str + href_str + AD_str            print(house_all)            # writer.writerow()          with open('链家新房100个.csv', 'a', newline='', encoding='utf-8') as file:              writer = csv.writer(file, delimiter=',')              writer.writerow(house_all)              file.close()        print(f'---第{i}页---')      times = choice([3, 4, 5, 6])      print(f'sleep{times}n')      time.sleep(times)  print('fin')

数据概况

当上边数据跑完了后，可以看到一个表格，打开后数据情况如下：

可以看到，小区名、地点、房型、平米数、方向、层数、建造年代、楼房形式、售价、对应详情页网址就都有啦~

如何利用这些数据进行数据分析？

见本公众号另一篇文章。

本文为读者投稿，作者：董汇标MINUS，首发于知乎，原文地址可以点击左下角原文链接。

二手房价格及信息爬取

VirMach 便宜 VPS

QNews

二手房价格及信息爬取

分享此文：

Related Posts

运筹学之“折衷系数”

Go1.18中的泛型编程

swoole运行模式加速laravel应用的详细介绍

如何有效利用企业资源，发挥Scrum最大优势？

VirMach 便宜 VPS

QNews

熱門搜尋