python爬虫:selenuim+phantomjs模拟浏览器操作,用BeautifulSoup解析页面,用requests下载文件
程序员文章站
2022-07-14 09:39:26
...
phantomjs安装(参考http://www.cnblogs.com/yestreenstars/p/5511212.html)
# 安装依赖软件
yum -y install wget fontconfig
# 下载PhantomJS
wget -P /tmp/ https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-i686.tar.bz2
# 解压
tar xjf /tmp/phantomjs-2.1.1-linux-i686.tar.bz2 -C /usr/local/
# 重命名
mv /usr/local/phantomjs-2.1.1-linux-i686 /usr/local/phantomjs
# 建立软链接
ln -s /usr/local/phantomjs/bin/phantomjs /usr/bin/
测试:
直接执行phantomjs命令,出现下图则表示成功:
selenuim安装参考官网教程,有时需要装浏览器驱动
这次目标网页为异步的,内容为JS生成的部分,JS中还带有翻页,每一页的内容都要爬下来。
使用selenuim+phantomjs模拟浏览器操作,phantomjs为不带窗口的浏览器,
这里需要注意执行完后webdriver需要退出,否则进程会一直保留,退出方法为quit(),详细用法参考代码。
代码如下:
# -*- coding: UTF-8 -*-
import re
import time
import Queue
import urllib2
import threading
import pickle
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium import *
def get_source(page_num):
list = []
#加载浏览器驱动
browser = webdriver.PhantomJS(executable_path="/usr/local/phantomjs/bin/phantomjs")
#窗口最大化
browser.maximize_window()
#与网址建立连接
browser.get('------------url------------')
#找到某元素并点击
browser.find_element_by_xpath("//a[contains(text(),'年报') and @data-toggle='tab']").click()
#等待5秒
time.sleep(5)
#将源码赋给beautifulsoup去解析
soup = BeautifulSoup(browser.page_source)
parser(soup)
list.extend(parser(soup))
pre = "//a[@class='classStr' and @id='idStr' and @page='"
suf = "' and @name='nameStr' and @href='javascript:;']"
for i in range(2,page_num):
browser.find_element_by_xpath(pre + str(i) + suf).click()
time.sleep(10)
soup = BeautifulSoup(browser.page_source)
list.extend(parser(soup))
browser.quit()
return list
def parser(soup):
pdf_path_list = []
domain = "------url---------"
#用正则找到所有‘href’属性包含‘pdf’的标签
for tag in soup.find_all(href=re.compile("pdf")):
hre = tag.get('href')
if hre.find("main_cn") < 0:
pdf_path = domain + hre
pdf_path_list.append(pdf_path)
print pdf_path
return pdf_path_list
def main():
list = get_source(96)
f = open('pdf_url.txt', 'w')
pickle.dump(list, f)
main()
selenuim元素定位参考 http://www.cnblogs.com/yufeihlf/p/5717291.html#test2
和 http://www.cnblogs.com/yufeihlf/p/5764807.html
BeautifulSoup解析页面参考 http://cuiqingcai.com/1319.html
以上代码得到了要下载的pdf的url,接着使用requests下载,代码如下:
import requests
import time
import pickle
def download_file(url):
# NOTE the stream=True parameter
local_filename = "./pdfs/" + url[url.rfind('/') + 1:]
print local_filename
r = requests.get(url, stream=True)
if r.status_code == 200:
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.flush()
print(url + " has been downloaded")
else:
print("r.status_code = " + str(r.status_code) + ", download failed")
f = open("pdf_url.txt", "r")
for line in f.readlines():
download_file(line.strip())
time.sleep(10)
上一篇: 单例模式之懒汉与饿汉
下一篇: 一些字符串的常用函数