python3爬取纵横网小说并写入文本文件

程序员文章站 2022-04-27 16:54:02

文中用到的库：requestBeautifulSouprequests库的一些方法：爬取网页主要有如下几个关键步骤：get请求则使用requests.get请求网页：response = requests.get(book_url, headers=header)soup = BeautifulSoup(response.text,'lxml')# 使用BeautifulSoup解析网页，解析的结果就是一个完整的html网页content = html.select....

文中用到的库：
request
BeautifulSoup

requests库的一些方法：

爬取网页主要有如下几个关键步骤：

get请求则使用requests.get请求网页：

response = requests.get(book_url, headers=header)

soup = BeautifulSoup(response.text,'lxml')# 使用BeautifulSoup解析网页，解析的结果就是一个完整的html网页

content = html.select('#readerFt > div > div.content > p')# 使用soup.select，通过标签查找正文

通过子标签查找时，尽量不使用完整的selector

比如下图中，正文都是放在class=content标签下的每一个<p></p>标签中

eg：第二个<p></p>标签复制出来的selector就是这样的：#readerFt > div > div.content > p:nth-child(2)，由于我们是爬取整篇小说，不止取第一段落，所以去掉p:nth-child(2)后面的nth-child(2)，直接为#readerFt > div > div.content > p

python3爬取纵横网小说并写入文本文件

完整的代码为：

# -*- coding: utf-8 -*-
import re
import requests
from bs4 import BeautifulSoup
from requests.exceptions import RequestException

def get_page(book_url):
    '''
        try... except... 通过response的状态码判断是否请求成功，若请求成功则使用BeautifulSoup解析网页，若状态码不是200，则抛出异常
    '''
    try:
        # 构建一个header，模拟浏览器的操作，有些网站做了限制，如果不使用header，则无法正常返回数据
        header = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}
        response = requests.get(book_url, headers=header)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text,'lxml')# 使用BeautifulSoup解析网页，解析的结果就是一个完整的html网页
            print(type(soup))# <class 'bs4.BeautifulSoup'>
            return soup
        return response.status_code
    except RequestException:
        return '请求失败！'
def download():
    html = get_page(book_url)
    content = html.select('#readerFt > div > div.content > p')# 使用soup.select，通过标签查找正文
    # print(content) #打印结果是list类型
    f = open('E:\\pyProject\\test1\\content.txt', 'w')
    for i in content:
        i = str(i) # 将类型为<class 'bs4.element.Tag'>强转为str类型
        f.write(i+'\n') # 将每一个段落都换行写入
    f.close()

'''
若想去掉<p></p>标签，可以使用下面的方法，使用一个正则表达式，仅获取<p></p>标签中的文字
'''
def download1():
    html = get_page(book_url)
    content_html = html.select('#readerFt > div > div.content')
    # print(content_html)
    content = re.findall(r'<p>(.*?)</p>', str(content_html), re.S)# 通过正则表达式获取<p></p>标签中的文字
    # print(content)
    f = open('E:\\pyProject\\test1\\content.txt', 'w')
    for n in content:
        f.write(str(n)+'\n')
    f.close()

if __name__=='__main__':
    book_url = 'http://book.zongheng.com/chapter/681832/37860473.html'
    download()
    # download1()

调用download()方法写入txt文件为：

python3爬取纵横网小说并写入文本文件

调用download1()方法写入txt文件的结果：

python3爬取纵横网小说并写入文本文件

至此，一个简单的爬取小说的脚本完成，撒花~~

本文地址：https://blog.csdn.net/dhr201499/article/details/107317802

上一篇：阿里系小程序矩阵“上线”的五点思考

下一篇： Python基础 - 深复制、浅复制

python3爬取纵横网小说并写入文本文件

python3爬取纵横网小说并写入文本文件

Python爬虫爬取小说，并保存至本地文本文件中

Python3爬虫爬取百姓网列表并保存为json功能示例【基于request、lxml和json模块】

小说免费看！python爬虫框架scrapy 爬取纵横网

python3爬取纵横网小说并写入文本文件

Python爬虫爬取小说，并保存至本地文本文件中

Python3爬虫爬取百姓网列表并保存为json功能示例【基于request、lxml和json模块】

小说免费看！python爬虫框架scrapy 爬取纵横网