初学者爬虫爬取漫画
程序员文章站
2022-04-25 23:09:38
...
所用到的工具:
1.工具是PyCharm
用到了这么几个库:
from selenium import webdriver
from lxml import etree
import requests
import time
2.Selenium+Headless Firefox配置:
下载后解压出一个叫geckodriver.exe的东西,将它和python.exe放在一起即可。
直接上代码
from selenium import webdriver
from lxml import etree
import requests
import time
Picreferer = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'
}
options = webdriver.FirefoxOptions()
options.add_argument('-headless')
options.add_argument('--disable-gpu')
driver = webdriver.Firefox()
driver.get('https://www.36mh.com/manhua/henchunhenaimei/')
driver.implicitly_wait(1000)
html = driver.page_source
etree_html = etree.HTML(html)
list = etree_html.xpath('//*[@id="chapter-list-4"]/li/a')
a = 1
x = 1
y = 1
for chapter in list:
back_url = chapter.get('href')
url = "https://www.36mh.com" + back_url
driver.get(url)
if a == 1:
driver.find_element_by_css_selector('#chapter-scroll').click()
a += 1
time.sleep(1)
chapter_html = driver.page_source
etree_chapter_html = etree.HTML(chapter_html)
driver.implicitly_wait(1000)
imgs = etree_chapter_html.xpath('//*[@id="images"]/img')
if len(imgs) < 5:
print(len(imgs))
driver.implicitly_wait(1000)
for img in imgs:
img_url = img.get('src')
with open("C://Users/revolutionary/Desktop/test/" + str(y) + "-" + str(x) + ".jpg", "wb") as f:
Img = requests.get(img_url, headers=Picreferer)
f.write(Img.content)
x += 1
y += 1
x = 1
稍稍解释一下代码:
运行成功是会自动打开浏览器的。
driver.get(‘https://www.36mh.com/manhua/henchunhenaimei/’)
这一句是进到目录界面,抓取目录嘛,list就是各章的目录。
PS:网站上是异步加载图片的,点击一下“下拉阅读”即可解决(点击一下就行了,以后进入其他章节也是保持下拉阅读的状态的)。
然后就会对每一章进行抓取保存啦
上一篇: 致 Python 初学者
下一篇: 致 Python 初学者