Python + selenium 爬取百度文库Word文本

程序员文章站 2022-04-14 18:43:34

1 # -*- coding:utf-8 -*- 2 3 import time 4 from selenium import webdriver 5 from selenium.webdriver.chrome.options import Options 6 from selenium.comm... ......

 1 # -*- coding:utf-8 -*-
 2  
 3 import time
 4 from selenium import webdriver
 5 from selenium.webdriver.chrome.options import options
 6 from selenium.common.exceptions import nosuchelementexception
 7  
 8 chrome_options = options()
 9 chrome_options.add_argument('--headless')
10 chrome_options.add_argument('--disable-gpu')
11 chrome_options.add_argument("--user-agent=mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/71.0.3578.80 safari/537.36")
12  
13 driver = webdriver.chrome(chrome_options=chrome_options)
14 driver.maximize_window()
15  
16 url = input("输入文档链接，搞快点：")
17 driver.get(url)
18  
19 error_str = ""
20  
21 try :
22     page_num = driver.find_element_by_xpath("//span[@class='page-count']").text
23  
24     find_button = driver.find_element_by_xpath("//div[@class='doc-banner-text']")
25     driver.execute_script("arguments[0].scrollintoview();", find_button)
26     button = driver.find_element_by_xpath("//span[@class='morebtn gobtn']")
27     button.click()
28  
29     for i in range(1,int(page_num.strip('/')) + 1) :
30         page = driver.find_element_by_xpath("//div[@data-page-no='{}']".format(i))
31         driver.execute_script("arguments[0].scrollintoview();", page)
32         time.sleep(0.3)
33         print(driver.find_elements_by_xpath("//div[@data-page-no='{}']//div[@class='reader-txt-layer']".format(i))[-1].text)
34  
35 except nosuchelementexception :
36     if driver.find_element_by_xpath("//div[@class='doc-bottom-text']").text == "试读已结束，如需继续阅读或下载" :
37         error_str = "\n------------------------------------------------------------------\n\n" \
38                       "----------百度文库提示试读已结束啦，无法爬取全文，等会再试试吧----------\n\n" \
39                       "------------------------------------------------------------------"
40  
41 finally :
42     print(error_str)

上一篇： Python开发【笔记】：从海量文件的目录中获取文件名--方法性能对比

下一篇： Typora 基础的使用方法

Python + selenium 爬取百度文库Word文本

Python实现的爬取百度文库功能示例

Python实现的爬取百度文库功能示例

Python3爬取百度文库数据

python——百度文库爬取

Python爬取百度文库付费文档（PDF）

Python爬取百度文库

二十一、Python爬取百度文库word文档内容

Python + selenium 爬取百度文库Word文本

python爬取百度文库所有内容

Python爬虫：爬取百度图片（selenium模拟登录，详细注释）