使用selenium爬取国家统计局人口普查信息

程序员文章站 2022-05-04 18:08:48

...

此爬虫使用了以下库:
selenium + ChromeDriver
beautifulsoup
requests

具体安装方法请自行百度,这里不过多赘述

爬虫主要分为两个模块
一个使用selenium获得网页内容,再使用beautifulsoup提取出地址信息.
另一个使用requests下载xls文件

代码如下:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

from bs4 import BeautifulSoup

import requests_download

import time

files = {}#文件名:链接

req_url = 'http://www.stats.gov.cn/tjsj/pcsj/rkpc/6rp/lefte.htm'
chrome_options = Options()
chrome_options.add_argument('--headless')#无头模式
browser = webdriver.Chrome(chrome_options=chrome_options)

browser.get(req_url)

soup = BeautifulSoup(browser.page_source, 'html.parser')
th = soup.find('th')
uls = th.find_all('ul')
for ul in uls:
    urls = ul.find_all('a')
    for url in urls:
        text = url.get_text()
        link = url.get('href')
        files[text] = link

for f in files:
    requests_download.download(f, files[f])#下载文件
    time.sleep(2)

browser.close()
browser.quit()

import requests

def download(file_name, url_file):
    url_file = 'http://www.stats.gov.cn/tjsj/pcsj/rkpc/6rp/' + url_file
    r = requests.get(url_file, stream=True)#获取服务器的原始套接字响应
    file_name = r'.\downloads\\' + file_name + '.xls'
    f = open(file_name, "wb+")
    for chunk in r.iter_content(chunk_size=512):#边下载边存硬盘
        if chunk:
            f.write(chunk)

部分结果如下:
使用selenium爬取国家统计局人口普查信息

新手上路,不足之处请多指教

上一篇：更新Jar包中的文件

下一篇： Linux下解压查看JAR包的方法

使用selenium爬取国家统计局人口普查信息

python使用requests模块实现爬取电影天堂最新电影信息

Python爬虫使用selenium爬取qq群的成员信息（全自动实现自动登陆）

Python使用Selenium爬取淘宝异步加载的数据方法

python+selenium爬取淘宝羽毛球拍信息

Python+selenium爬取智联招聘的职位信息

python3爬虫-通过selenium登陆拉钩，爬取职位信息

21天打造分布式爬虫-Selenium爬取拉钩职位信息（六）

使用python爬虫实现网络股票信息爬取的demo

selenium+phantomjs爬取京东商品信息

python爬虫基于Selenium的股票信息爬取工具实现