欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

携程酒店数据爬取(新)

程序员文章站 2022-03-15 15:01:20
携程酒店数据爬取(新)前言:由于携程网页的变化,以及不断的反击爬虫,导致目前许多携程的爬虫代码无法爬取到数据。本文核心:根据更换cookies的值得到携程酒店数据...

携程酒店数据爬取(新)

前言:由于携程网页的变化,以及不断的反击爬虫,导致目前许多携程的爬虫代码无法爬取到数据。
本文核心:根据更换cookies的值得到携程酒店数据
主要包含以下四个部分

  1. headers
  2. data
  3. json解析
  4. 完整代码

前言

环境:python3.6+requests
包含部分文件写入操作

1、headers

爬虫程序需要模仿浏览器进行访问,因此headers属性必不可少,可以在网页中轻松找到

headers = {
        "Connection": "keep-alive",
        "Cookie":cookies,
        "origin": "https://hotels.ctrip.com",
        "Host": "hotels.ctrip.com",     
        "referer": "https://hotels.ctrip.com/hotel/qamdo575",
        "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36",
        "Content-Type":"application/x-www-form-urlencoded; charset=utf-8"
    }

其中较为重要的部分就是cookies,假如没有cookies会直接导致验证失败,获得空数据,并且在cookies需要登录后的cookies。

2、data属性

由于采用数据接口的方式爬取数据,因此主要组合相应的data属性,才能获得准确的返回值。在浏览器检索中,从header里面可以找到我们需要的data属性。

data = {
            "StartTime": "2020-10-09",
            "DepTime": "2019-10-10",
            "RoomGuestCount": "1,1,0",
            "cityId": 575,
            "cityPY": "qamdo",
            "cityCode": "0895",
            "page": page
        }

3、json解析

找到准确的数据接口之后,我们需要利用requests库,发送get或者post请求,拼接之前的headers和data参数,得到对应的json数据。
得到的json数据可以利用切片得到各种属性值,例如链接、评分、地址等。

 html = requests.post(url, headers=headers, data=data)
 hotel_list = html.json()["hotelPositionJSON"]

4、完整代码

# coding=utf8
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import random
import time
import csv
import json
import re
from tqdm import tqdm
# Pandas display option
pd.set_option('display.max_columns', 10000)
pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_colwidth', 10000)
pd.set_option('display.width',1000)

url = "https://hotels.ctrip.com/Domestic/Tool/AjaxHotelList.aspx"
filename = "F:\\aaa\\changdu.csv"
print(requests.post(url))
def Scrap_hotel_lists():
    cookies = ''' ......"'
    headers = {
        "Connection": "keep-alive",
        "Cookie":cookies,
        "origin": "https://hotels.ctrip.com",
        "Host": "hotels.ctrip.com",     
        "referer": "https://hotels.ctrip.com/hotel/qamdo575",
        "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36",
        "Content-Type":"application/x-www-form-urlencoded; charset=utf-8"
    }
    id = []
    name = []
    hotel_url = []
    address = []
    score = []
    star = []
    stardesc=[]
    lat=[]
    lon=[]
    dpcount=[]
    dpscore=[]
    for page in tqdm(range(1,13) ,desc='进行中',ncols=10):
        data = {
            "StartTime": "2020-10-09",
            "DepTime": "2019-10-10",
            "RoomGuestCount": "1,1,0",
            "cityId": 575,
            "cityPY": "qamdo",
            "cityCode": "0895",
            "page": page
        }
        html = requests.post(url, headers=headers, data=data)
        hotel_list = html.json()["hotelPositionJSON"]
        for item in hotel_list:
            print(item)
            id.append(item['id'])
            name.append(item['name'])
            hotel_url.append(item['url'])
            address.append(item['address'])
            score.append(item['score'])
            stardesc.append(item['stardesc'])
            lat.append(item['lat'])
            lon.append(item['lon'])
            dpcount.append(item['dpcount'])
            dpscore.append(item['dpscore'])
            if(item['star']==''):
                star.append('NaN')
            else:
                star.append(item['star'])
        time.sleep(random.randint(3,5))
    hotel_array = np.array((id, name, score, hotel_url, address,star,stardesc,lat,lon,dpcount,dpscore)).T
    list_header = ['id', 'name', 'score', 'url', 'address',
                   'star','stardesc','lat','lon','dpcount','dpscore']
    array_header = np.array((list_header))
    hotellists = np.vstack((array_header, hotel_array))
    with open(filename, 'w', encoding="utf-8-sig", newline="") as f:
        csvwriter = csv.writer(f, dialect='excel')
        csvwriter.writerows(hotellists)
if __name__ == "__main__":
    Scrap_hotel_lists()
    df = pd.read_csv(filename, encoding='utf8')
    print(df)

备注:xiecheng网站经常发生改版,此程序仅用于学习

本文地址:https://blog.csdn.net/weixin_45026680/article/details/108609247

相关标签: python 爬虫