Python爬虫练习

程序员文章站 2022-07-08 15:50:45

Python爬虫实例爬取网络小说并保存至txt文件中爬取网络小说并保存至txt文件中简单梳理下爬取思路：1、明确想要爬取的小说网站，查看网页源代码2、分析网页源码的特点，明确章节页的标题、正文所在标签的位置3、导入需要用到的库文件4、获取目录页、章节页URL并发送请求5、解析章节页正文内容并剔除冗余部分6、将爬取到的章节正文写入txt文件中完整代码如下：#导入库文件import requestsfrom bs4 import BeautifulSoup#获取章节页并发送请求def...

Python爬虫实例

爬取网络小说并保存至txt文件中

爬取网络小说并保存至txt文件中

简单梳理下爬取思路：
1、明确想要爬取的小说网站，查看网页源代码
2、分析网页源码的特点，明确章节页的标题、正文所在标签的位置
3、导入需要用到的库文件
4、获取目录页、章节页URL并发送请求
5、解析章节页正文内容并剔除冗余部分
6、将爬取到的章节正文写入txt文件中
完整代码如下：

#导入库文件
import requests
from bs4 import BeautifulSoup
#获取章节页并发送请求
def get_one_page(num):
    #获取读书网小说《诡秘之主》的目录页URL
    content_url = 'https://www.dusuu.com/ml/505/'
    #获取章节页URL
    url = content_url + str(num) + '.html'
    try:
        #进行头部伪装
        headers = {
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36 Edg/87.0.664.55'
        }
        #使用requests.get()方法请求并获取网页内容
        res = requests.get(url, headers = headers)
        #对内容进行编码以显示中文
        res.encoding = 'utf8'
        #请求发送成功情况下返回网页的文本数据
        if res.status_code == 200:
            return res.text
        return None
    except RequestException:
        return None
#解析章节页
def parse_one_page(html):
    #使用python标准库的HTML解析器解析
    soup = BeautifulSoup(html, 'html.parser')
    #利用CSS选择器精确定位元素，使用select()函数解析，提取章节标题与正文
    raw_title = str(soup.select('body > div.content_read > div > div.bookname > h1'))
    #使用replace()函数剔除冗余内容
    title = raw_title.replace('<h1>','').replace('</h1>','')
    raw_body = str(soup.select('#content'))
    body = raw_body.replace('<p>','').replace('</p>','').replace('<div id="content">','').replace('</div>','')
    chapter = [title, body]
    return chapter
#写入txt文件
def write_to_file(chapter):    
    #使用with as语法在with控制块结束时文件会自动关闭
    with open('novel.txt','a',encoding = 'utf-8') as file:
        file.write('\n' +  '=' * 50 +'\n')
        file.write('\n'.join(chapter))
        file.write('\n' +  '=' * 50 +'\n')
     
def main():
    #分页爬取，分析网页url，3533477为小说章节起始编号，章节加1则编号加1
    begin = 3533477
    num = begin
    while num < (begin + 100):
        html = get_one_page(num)
        chapter = parse_one_page(html)
        write_to_file(chapter)
        num += 1

main()

txt文件内容如下：
Python爬虫练习

本文地址：https://blog.csdn.net/qq_47155092/article/details/110928621

上一篇：我自己的学习python的第一天

下一篇： opencv-python图像预处理后处理记录

Python爬虫练习

Python爬虫实例

爬取网络小说并保存至txt文件中

Python生成器（Generator）详解

将Python中的数据存储到系统本地的简单方法

Python中函数的多种格式和使用实例及小技巧

详解Python中的文本处理

Python httplib模块使用实例

Python中的各种装饰器详解

Python函数参数类型*、**的区别

状态机的概念和在Python下使用状态机的教程

在Python中使用SimpleParse模块进行解析的教程

Python中的进程分支fork和exec详解