对于*网站下发的文件进行爬取,减少人去下载的过程
程序员文章站
2022-06-24 09:01:59
对于*网站下发的文件进行爬取,减少人去下载的过程 ......
对于*网站下发的文件进行爬取,减少人去下载的过程
博问上有人不会,我写了一下
绝对不要加多线程多线程进去
import re import requests from lxml.html import etree url = 'http://www.liyang.gov.cn/default.php?mod=article&fid=163250&s99679207_start=0' rp = requests.get(url) re_html = etree.html(rp.text) url_xpath = '//*[@id="s99679207_content"]/table/tbody/tr/td/span[1]/span/a/@href' title_xpath = '//*[@id="s99679207_content"]/table/tbody/tr/td/span[1]/span/a/text()' url_list = re_html.xpath(url_xpath) title_list = re_html.xpath(title_xpath) title_list = title_list[::-1] data_url_list = [] for url_end in url_list: new_url = f'http://www.liyang.gov.cn/{url_end}' print(new_url) rp_1 = requests.get(new_url) print(rp_1.text) try: re_1_html = etree.html(rp_1.text) data_url_xpth = '//tbody/tr[1]/td[2]/a' data_url = re_1_html.xpath(data_url_xpth)[0] except: data_list = re.findall('<a href="(.*?)" target="_blank">', rp_1.text) data_url = data_list[0] print(data_url) data_url = f'http://www.liyang.gov.cn/{data_url}' re = requests.get(data_url) data = re.content with open(f'{title_list.pop()}.pdf', 'wb') as fw: fw.write(data)
上一篇: C++ 基础中的基础 ---- 引用
下一篇: 昨天初恋突然打电话来说