scrapy 爬取纵横网实战

程序员文章站 2022-06-01 12:07:23

前言闲来无事就要练练代码，不知道最近爬取什么网站好，就拿纵横网爬取我最喜欢的雪中悍刀行练手吧准备 python3 scrapy 项目创建： cmd命令行切换到工作目录创建scrapy项目两条命令 scarpy startproject与scrapy genspider 然后用pycharm打开 ......

前言

闲来无事就要练练代码，不知道最近爬取什么网站好，就拿纵横网爬取我最喜欢的雪中悍刀行练手吧

准备

python3
scrapy

项目创建：

cmd命令行切换到工作目录创建scrapy项目两条命令 scarpy startproject与scrapy genspider 然后用pycharm打开项目

d:\pythonwork>scrapy startproject zongheng
new scrapy project 'zongheng', using template directory 'c:\users\11573\appdata\local\programs\python\python36\lib\site-packages\scrapy\templates\project', created in:
    d:\pythonwork\zongheng

you can start your first spider with:
    cd zongheng
    scrapy genspider example example.com

d:\pythonwork>cd zongheng

d:\pythonwork\zongheng>cd zongheng

d:\pythonwork\zongheng\zongheng>scrapy genspider xuezhong http://book.zongheng.com/chapter/189169/3431546.html
created spider 'xuezhong' using template 'basic' in module:
  zongheng.spiders.xuezhong

确定内容

首先打开网页看下我们需要爬取的内容

scrapy 爬取纵横网实战

其实小说的话结构比较简单只有三大块卷章节内容

因此 items.py代码：

# -*- coding: utf-8 -*-

# define here the models for your scraped items
#
# see documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class zonghengitem(scrapy.item):
    # define the fields for your item here like:
    # name = scrapy.field()
    book = scrapy.field()
    section = scrapy.field()
    content = scrapy.field()
    pass

内容提取spider文件编写

还是我们先创建一个main.py文件方便我们测试代码

from scrapy import cmdline
cmdline.execute('scrapy crawl xuezhong'.split())

然后我们可以在spider文件中先编写

# -*- coding: utf-8 -*-
import scrapy


class xuezhongspider(scrapy.spider):
    name = 'xuezhong'
    allowed_domains = ['http://book.zongheng.com/chapter/189169/3431546.html']
    start_urls = ['http://book.zongheng.com/chapter/189169/3431546.html/']

    def parse(self, response):
        print(response.text)
        pass

运行main.py看看有没有输出

发现直接整个网页的内容都可以爬取下来，说明该网页基本没有反爬机制，甚至不用我们去修改user-agent那么就直接开始吧

打开网页 f12查看元素位置并编写xpath路径然后编写spider文件

需要注意的是我们要对小说内容进行一定量的数据清洗，因为包含某些html标签我们需要去除

# -*- coding: utf-8 -*-
import scrapy
import re
from zongheng.items import zonghengitem


class xuezhongspider(scrapy.spider):
    name = 'xuezhong'
    allowed_domains = ['book.zongheng.com']
    start_urls = ['http://book.zongheng.com/chapter/189169/3431546.html/']

    def parse(self, response):
        xuezhong_item = zonghengitem()
        xuezhong_item['book'] = response.xpath('//*[@id="reader_warp"]/div[2]/text()[4]').get()[3:]
        xuezhong_item['section'] = response.xpath('//*[@id="readerft"]/div/div[2]/div[2]/text()').get()

        content = response.xpath('//*[@id="readerft"]/div/div[5]').get()
        #content内容需要处理因为会显示<p></p>标签和<div>标签
        content = re.sub(r'</p>', "", content)
        content = re.sub(r'<p>|<div.*>|</div>',"\n",content )

        xuezhong_item['content'] = content
        yield xuezhong_item

        nextlink = response.xpath('//*[@id="readerft"]/div/div[7]/a[3]/@href').get()
        print(nextlink)
        if nextlink:
            yield scrapy.request(nextlink,callback=self.parse)

有时候我们会发现无法进入下个链接，那可能是被allowed_domains过滤掉了我们修改下就可以

唉突然发现了到第一卷的一百多章后就要vip了那我们就先只弄一百多章吧不过也可以去其他网站爬取免费的这次我们就先爬取一百多章吧

内容保存

接下来就是内容的保存了，这次就直接保存为本地txt文件就行了

首先去settings.py文件里开启 item_pipelines

然后编写pipelines.py文件

# -*- coding: utf-8 -*-

# define your item pipelines here
#
# don't forget to add your pipeline to the item_pipelines setting
# see: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class zonghengpipeline(object):
    def process_item(self, item, spider):
        filename = item['book']+item['section']+'.txt'
        with open("../xuezhongtxt/"+filename,'w') as txtf:
            txtf.write(item['content'])
        return item

由于选址失误导致了我们只能爬取免费的一百多章节，尴尬，不过我们可以类比运用到其他网站爬取全文免费的书籍

怎么样使用scrapy爬取是不是很方便呢

scrapy 爬取纵横网实战

上一篇：【第十四篇】Python 迭代器

下一篇： “勒石三戒”是什么？宋朝为什么能创造文化高峰时期？

scrapy 爬取纵横网实战

前言

准备

项目创建：

确定内容

内容提取spider文件编写

内容保存

Python百行不到爬取当当网的图片以及标题导入数据库

Python爬虫实战之爬取某宝男装信息

使用Python的Scrapy框架十分钟爬取美女图

python scrapy框架爬取80s保存mysql

Python爬虫实战用 BeautifulSoup 爬取电影网站信息

Python实现爬取腾讯招聘网岗位信息

Scrapy 爬取某网站图片

python爬虫项目实战：爬取500px图片

使用爬虫框架scrapy爬取网站妹子图

Python利用Scrapy框架爬取豆瓣电影示例