一个简单的爬虫项目（爬取小说）

程序员文章站 2022-01-01 12:24:38

目录1.工具介绍2.详细代码介绍3.完整代码4.结果呈现1.工具介绍1.1我们所需要用到第三方库requests #爬虫所需要的最基本的第三方库re #正则表达式1.2安装的方式pip install requestspip install re1.3导入第三方库的方式import requestsimport re2.详细代码介绍首先请求我们要访问的页面的url（这里是随机选择的笔趣阁的一本小说爬取）url = 'http://www.b....

目录
1.工具介绍
2.详细代码介绍
3.完整代码
4.结果呈现

1.工具介绍
1.1我们所需要用到第三方库
requests #爬虫所需要的最基本的第三方库
re #正则表达式
1.2安装的方式

pip install requests
pip install re

1.3导入第三方库的方式

import requests
import re

2.详细代码介绍
首先请求我们要访问的页面的url（这里是随机选择的笔趣阁的一本小说爬取）

url = 'http://www.biquge.info/10_10582/'

使用get方法请求数据对象，并给他一个响应参数

response = requests.get(url)

因为爬取的字体可能会发生乱码，所以我们在这里设置一下（这里的字体编译不一定要加上，如果下面请求文本的时候发生乱码就可以加上，或者直接设置为utf-8编码。）

response.encoding = response.apparent_encoding

设置一个参数接受我们请求响应的对象的文本内容

html_data = response.text

这里使用正则表达式（此处的 (.*?) 算是正则表达式里面的一个万能提取公式）提取我们的小说章节下的url以及标题，并构建一个列表

result_list = re.findall('<dd><a href="(.*?)" title=".*">.*</a></dd>', html_data)
top_10 = result_list[1:11]

在列表里面循环打印出每个章节的小标题以及获取到章节下的文本内容并打印输出

for top in top_10:
    all_url = 'http://www.biquge.info/10_10582/' + top

    response_2 = requests.get(all_url)
    response_2.encoding = response_2.apparent_encoding
    html_data_2 = response_2.text
    title = re.findall('<h1>(.*?)</h1>', html_data_2, re.S)[0]
    contend = re.findall('<div id="content"><!--go-->(.*?)</div>', html_data_2, re.S)[0]
    print(title, contend)

最后在当前目录下直接创建一个文件夹并以章节命名

 with open('三寸人间\\' + title + '.txt', mode='w', encoding='utf-8') as f:
        f.write(contend.replace('&nbsp;', '').replace('<br/>', '\n'))
        print('正在下载:', title)

3.完整代码

import requests
import re

url = 'http://www.biquge.info/10_10582/'

response = requests.get(url)
response.encoding = response.apparent_encoding
html_data = response.text

result_list = re.findall('<dd><a href="(.*?)" title=".*">.*</a></dd>', html_data)


top_10 = result_list[1:11]

for top in top_10:
    all_url = 'http://www.biquge.info/10_10582/' + top

    response_2 = requests.get(all_url)
    response_2.encoding = response_2.apparent_encoding
    html_data_2 = response_2.text

    title = re.findall('<h1>(.*?)</h1>', html_data_2, re.S)[0]
    contend = re.findall('<div id="content"><!--go-->(.*?)</div>', html_data_2, re.S)[0]
    print(title, contend)

    with open('三寸人间\\' + title + '.txt', mode='w', encoding='utf-8') as f:
        f.write(contend.replace('&nbsp;', '').replace('<br/>', '\n'))
        print('正在下载:', title)

4.结果呈现
一个简单的爬虫项目（爬取小说）

本文地址：https://blog.csdn.net/qq_43470809/article/details/111053617

一个简单的爬虫项目（爬取小说）

我的第一个爬虫，爬取北京地区短租房信息

Python实现的爬取小说爬虫功能示例

Python爬虫爬取一个网页上的图片地址实例代码

C#网络爬虫代码分享 C#简单的爬取工具

PYTHON爬虫大作业：豆瓣读书“小说”标签下1000本书籍的爬取与分析

Python爬虫之简单的爬取百度贴吧数据

一个简单的python爬虫程序爬取豆瓣热度Top100以内的电影信息

想起以前写的一个爬虫，然后就用C#WinForm写了一个下载小说的软件，比较简单

做了个简单的post请求爬虫，爬取广东省科技厅关于创新的新闻

写一个爬取中国天气网的终端版天气预报爬虫

一个简单的爬虫项目（爬取小说）

我的第一个爬虫，爬取北京地区短租房信息

Python实现的爬取小说爬虫功能示例

Python爬虫爬取一个网页上的图片地址实例代码

C#网络爬虫代码分享 C#简单的爬取工具

PYTHON爬虫大作业：豆瓣读书“小说”标签下1000本书籍的爬取与分析

Python爬虫之简单的爬取百度贴吧数据

一个简单的python爬虫程序 爬取豆瓣热度Top100以内的电影信息

想起以前写的一个爬虫，然后就用C#WinForm写了一个下载小说的软件，比较简单

做了个简单的post请求爬虫，爬取广东省科技厅关于创新的新闻

写一个爬取中国天气网的终端版天气预报爬虫

一个简单的python爬虫程序爬取豆瓣热度Top100以内的电影信息