欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

xpath解析数据(爬取全国城市名称)

程序员文章站 2022-05-07 23:09:28
...

目标网站:https://www.aqistudy.cn/historydata/

# 开发时间:2020/12/27 22:00
# 开发工具:PyCharm
# 开发者:Friday
# 网址 https://www.aqistudy.cn/historydata/
import requests
from lxml import etree

if __name__ == "__main__":
    headers = {
        'Referer': 'http://pic.netbian.com/4kmeinv/index_2.html',
        'user_agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    }
    url = 'https://www.aqistudy.cn/historydata/'
    response = requests.get(url = url, headers = headers)
    page_text = response.text
    tree = etree.HTML(page_text)
    #方法一:
    # # 热门城市
    # host_city_list = tree.xpath('//div[@class="bottom"]/ul/li')
    # host_name_list = []
    # for li in  host_city_list:
    #     host_name = li.xpath('./a/text()')[0]
    #     host_name_list.append(host_name)
    # # print(host_name_list)
    #
    # #1.
    # # all_city_list = []
    # # all_city_ul_list = tree.xpath('//div[@class="bottom"]/ul')
    # # for ul in all_city_ul_list:
    # #     get_li_list = ul.xpath('./div/li')
    # #     for li in get_li_list:
    # #         name = li.xpath('./a/text()')[0]
    # #         host_name_list.append(name)
    # #2.
    # # all_city_li = tree.xpath('//div[@class="bottom"]/ul/div[2]/li')
    # # for li in all_city_li:
    # #     name = li.xpath('./a/text()')[0]
    # #     host_name_list.append(name)
    # print(host_name_list)
    # print(len(host_name_list))

    #方法二:
    a_list = tree.xpath('//div[@class="bottom"]/ul/li/a | //div[@class="bottom"]/ul/div[2]/li/a')
    all_city_names = []
    for a in a_list:
        city_name = a.xpath('./text()')[0]
        all_city_names.append(city_name)
    print(all_city_names)
    print(len(all_city_names))

总结:查看网页的代码结构,比较容易想到的就是进行两次xpath解析,分别获取“热门城市”和“全部城市”的li标签,但仔细思考,还是可以进一步优化的,由于我们要爬取的城市名称都在a标签下,所以我们可以利用xpath同时解析出两者所对应的a标签,然后再统一操作。