Python爬虫学习-Day7

程序员文章站 2022-05-04 17:55:12

...

利用selenium模拟登陆，爬取帖子信息

代码如下：

import time
from selenium import webdriver
from lxml import etree
import json


browser = webdriver.Chrome()
url = 'http://www.dxy.cn/bbs/index.html'
browser.get(url)
time.sleep(3)
browser.maximize_window()#打开网页窗口
time.sleep(5)
#browser.switch_to.frame(0)#找到邮箱账号登录框对应的iframe
web_login = browser.find_element_by_xpath('//*[@id="headerwarp"]/div/div[1]/div/a[1]').click()#点击登陆
web_computer = browser.find_element_by_xpath('/html/body/div[2]/div[2]/div[1]/a[2]/i').click()#点到电脑登陆界面
web_loginput = browser.find_element_by_xpath('//*[@id="username"]').send_keys('***********')#输入账号
password = browser.find_element_by_xpath('//*[@id="user"]/div[1]/div[1]/div[1]/div[2]/input')#找到密码输入框   
password.send_keys('**********')#输入自己的密码  
login_em = browser.find_element_by_xpath('//*[@id="user"]/div[1]/div[3]/button')#找到登陆按钮  
login_em.click()#点击登陆按钮     
time.sleep(30)
#此处有验证码，暂时未解决，人工验证后爬取信息

browser.get('http://www.dxy.cn/bbs/topic/509959?keywords=%E6%99%95%E5%8E%A5%E5%BE%85%E6%9F%A5%E2%80%94%E2%80%94%E8%AF%B7%E6%95%99%E5%90%84%E4%BD%8D%E5%90%8C%E4%BB%81+-+%E5%BF%83%E8%A1%80%E7%AE%A1%E4%B8%93%E4%B8%9A%E8%AE%A8%E8%AE%BA%E7%89%88+-%E4%B8%81%E9%A6%99%E5%9B%AD%E8%AE%BA%E5%9D%9B%E2%80%8B+')

html = browser.page_source


selector = etree.HTML(html)

use = selector.xpath("""//*/table/tbody/tr/td[1]/div[2]/a/text()""")

s = selector.xpath('//*[@id="postcontainer"]')[0].xpath('div//td[@class="postbody"]')
print(len(selector.xpath('//*[@id="postcontainer"]')))
L = []
for uses,ss in zip(use,s):
    a = "用户：" + uses
    b = "：" + ''.join(ss.xpath('text()')).strip()
    #b = "回复内容：" + ss.strip()
    dic = {a:b}
    L.append(dic)
    with open("丁香园信息.csv", 'a', encoding="utf-8") as f:
        f.write(json.dumps(dic, ensure_ascii=False)+'\n')
print(L)#打印信息，查看是否有误

遗留问题：未解决验证码问题

相关标签： selenium模拟登陆丁香园信息

上一篇：三朝为官的重臣张廷玉，最后结局怎么样？

下一篇：树状的组合模式（composite pattern）

Python爬虫学习-Day7

Python学习日记(十四) 正则表达式和re模块

Python实现爬取百度贴吧帖子所有楼层图片的爬虫示例

Python爬虫层层递进，从爬取一章小说到爬取全站小说

以Python的Pyspider为例剖析搜索引擎的网络爬虫实现方法

Python学习之旅（二十七）

Python学习日记(二十五) 接口类、抽象类、多态

Python学习笔记--使用matplotlib绘制饼状图

python学习笔记---面向对象VSMatlab Style

Python实现爬虫抓取与读写、追加到excel文件操作示例

Python网络爬虫开发从环境搭建到实例爬取网页