xpath解析数据(爬取全国城市名称)
程序员文章站
2022-05-07 23:09:28
...
# 开发时间:2020/12/27 22:00
# 开发工具:PyCharm
# 开发者:Friday
# 网址 https://www.aqistudy.cn/historydata/
import requests
from lxml import etree
if __name__ == "__main__":
headers = {
'Referer': 'http://pic.netbian.com/4kmeinv/index_2.html',
'user_agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
url = 'https://www.aqistudy.cn/historydata/'
response = requests.get(url = url, headers = headers)
page_text = response.text
tree = etree.HTML(page_text)
#方法一:
# # 热门城市
# host_city_list = tree.xpath('//div[@class="bottom"]/ul/li')
# host_name_list = []
# for li in host_city_list:
# host_name = li.xpath('./a/text()')[0]
# host_name_list.append(host_name)
# # print(host_name_list)
#
# #1.
# # all_city_list = []
# # all_city_ul_list = tree.xpath('//div[@class="bottom"]/ul')
# # for ul in all_city_ul_list:
# # get_li_list = ul.xpath('./div/li')
# # for li in get_li_list:
# # name = li.xpath('./a/text()')[0]
# # host_name_list.append(name)
# #2.
# # all_city_li = tree.xpath('//div[@class="bottom"]/ul/div[2]/li')
# # for li in all_city_li:
# # name = li.xpath('./a/text()')[0]
# # host_name_list.append(name)
# print(host_name_list)
# print(len(host_name_list))
#方法二:
a_list = tree.xpath('//div[@class="bottom"]/ul/li/a | //div[@class="bottom"]/ul/div[2]/li/a')
all_city_names = []
for a in a_list:
city_name = a.xpath('./text()')[0]
all_city_names.append(city_name)
print(all_city_names)
print(len(all_city_names))
总结:查看网页的代码结构,比较容易想到的就是进行两次xpath解析,分别获取“热门城市”和“全部城市”的li标签,但仔细思考,还是可以进一步优化的,由于我们要爬取的城市名称都在a标签下,所以我们可以利用xpath同时解析出两者所对应的a标签,然后再统一操作。
上一篇: Java日历类Calendar的简单使用