北京市房价预测---数据收集
程序员文章站
2022-07-14 14:26:04
...
蓝房网爬虫bs4+requests+北京
1.查看网址的结构
由下图可知楼盘信息的url。明显可知后缀为search-y{}。收集的信息为楼盘名、地址、开盘时间、价格、销售状态。
2.soup.select()方法的常用方法
1、class
对于html内的内容,可以通过class来进行定位,一般形式为:
soup.selecet('.class')
这样可以定位到所有class内容的内容。
2、id
id在一个html中是唯一的,因此可以通过id来找寻唯一的内容,形式为:
soup.select('#id')
3、标签
标签的话,可以直接寻找:
soup.select('a')
4、组合查找
某一类下的某个标签中的内容,采用空格隔开:
soup.select('.class a')
3.爬虫代码
def getHousesDetails(url):
from bs4 import BeautifulSoup
import requests
request = requests.get(url)
request.encoding = 'utf-8'
soup = BeautifulSoup(request.text,'lxml')
houses = soup.select('.lpList')
housesDetails = []
for house in houses:
#获取楼盘名字
houseName = house.select('.title h2 a')[0].text
#获取楼盘地址
address = house.select('.lpTxt div')[1].select('p')[1].text.strip('楼盘地址: 查看地图')
if(len(address) >= 16):
houseDetailHref = house.select('.title h2 a')[0]['href']
request = requests.get(houseDetailHref)
request.encoding = 'utf-8'
soup = BeautifulSoup(request.text,'lxml')
address = soup.select('.toplpMsg ul li div i')[0].text.strip('楼盘地址:')
#获取楼盘开盘时间
openTime = house.select('.lpTxt div')[1].select('p')[3].text.strip('开盘时间:')
#获取楼盘价格
price = house.select('.price p b')[0].text
#获取楼盘销售状态
def numberToString(number):
switcher = {
1: "在售",
3: "尾盘",
5: "未售",
15: "售罄"
}
return switcher.get(number,'未知')
saleStatusImg = house.select('.title p img')[0]['src']
saleStatusId = int(saleStatusImg.lstrip('/public/images/state_').rstrip('.jpg'))
saleStatus = numberToString(saleStatusId)
#将所有楼盘信息做成楼盘信息字典
houseDetails = {}
houseDetails['houseName'] = houseName
houseDetails['address'] = address
houseDetails['openTime'] = openTime
houseDetails['price'] = price
houseDetails['saleStatus'] = saleStatus
housesDetails.append(houseDetails)
return housesDetails
def getAllHousesDetails():
maxPageNumber = 208
urlBefore = 'http://house.lanfw.com/bj/search-y{}'
allHousesDetails = []
for i in range(1,maxPageNumber+1):
url = urlBefore.format(i)
allHousesDetails.extend(getHousesDetails(url))
import pandas
dataframe = pandas.DataFrame(allHousesDetails)
return dataframe
if __name__ == '__main__':
allHousesDetails = getAllHousesDetails()
allHousesDetails.to_excel('houseDetails2.xlsx')
4.爬取结果
获取北京楼盘结果2073条。
参考:https://www.jianshu.com/p/72fd7898ea8a
上一篇: 根据人脸预测年龄性别和情绪 (python + keras)(三)
下一篇: 灰色预测程序