Scrapy爬取博客园精华区内容

程序员文章站 2022-05-18 22:44:53

程序爬取目标获取博客园精华区文章的标题、标题链接、作者、作者博客主页链接、摘要、发布时间、评论数、阅读数和推荐数，并存储到 MongoDB 中。程序环境已安装scrapy 已安装MongoDB 创建工程在命令提示符中执行上述命令后，会建立一个名为的文件夹。创建爬虫文件执行上述命令后 ......

程序爬取目标

获取博客园精华区文章的标题、标题链接、作者、作者博客主页链接、摘要、发布时间、评论数、阅读数和推荐数，并存储到mongodb中。

程序环境

已安装scrapy
已安装mongodb

创建工程

scrapy startproject cnblogs

在命令提示符中执行上述命令后，会建立一个名为cnblogs的文件夹。

创建爬虫文件

cd cnblogs
scrapy genspider cn cnblogs.com

执行上述命令后，会在cnblogs\spiders\下新建一个名为cn.py的爬虫文件，cnblogs.com为允许爬取的域名。

编写items.py文件

定义需要爬取的内容。

import scrapy

class cnblogsitem(scrapy.item):
    # define the fields for your item here like:
    post_author = scrapy.field()    #发布作者
    author_link = scrapy.field()    #作者博客主页链接
    post_date = scrapy.field()      #发布时间
    digg_num = scrapy.field()       #推荐数
    title = scrapy.field()          #标题
    title_link = scrapy.field()     #标题链接
    item_summary = scrapy.field()   #摘要
    comment_num = scrapy.field()    #评论数
    view_num = scrapy.field()       #阅读数

编写爬虫文件cn.py

import scrapy
from cnblogs.items import cnblogsitem

class cnspider(scrapy.spider):
    name = 'cn'
    allowed_domains = ['cnblogs.com']
    start_urls = ['https://www.cnblogs.com/pick/']

    def parse(self, response):
        div_list = response.xpath("//div[@id='post_list']/div")
        for div in div_list:
            item = cnblogsitem()
            item["post_author"] = div.xpath(".//div[@class='post_item_foot']/a/text()").extract_first()
            item["author_link"] = div.xpath(".//div[@class='post_item_foot']/a/@href").extract_first()
            item["post_date"] = div.xpath(".//div[@class='post_item_foot']/text()").extract()
            item["comment_num"] = div.xpath(".//span[@class='article_comment']/a/text()").extract_first()
            item["view_num"] = div.xpath(".//span[@class='article_view']/a/text()").extract_first()
            item["title"] = div.xpath(".//h3/a/text()").extract_first()
            item["title_link"] = div.xpath(".//h3/a/@href").extract_first()
            item["item_summary"] = div.xpath(".//p[@class='post_item_summary']/text()").extract()
            item["digg_num"] = div.xpath(".//span[@class='diggnum']/text()").extract_first()
            yield item

        next_url = response.xpath(".//a[text()='next >']/@href").extract_first()
        if next_url is not none:
            next_url = "https://www.cnblogs.com" + next_url
            yield scrapy.request(
                next_url,
                callback=self.parse
            )

编写pipelines.py文件

对抓取到的数据进行简单处理，去除无效的字符串，并保存到mongodb中。

from pymongo import mongoclient
import re

client = mongoclient()
collection = client["test"]["cnblogs"]

class cnblogspipeline(object):
    def process_item(self, item, spider):
        item["post_date"] = self.process_string_list(item["post_date"])
        item["comment_num"] = self.process_string(item["comment_num"])
        item["item_summary"] = self.process_string_list(item["item_summary"])
        print(item)
        collection.insert(dict(item))
        return item

    def process_string(self,content_string):
        if content_string is not none:
            content_string = re.sub(" |\s","",content_string)
        return content_string

    def process_string_list(self,string_list):
        if string_list is not none:
            string_list = [re.sub(" |\s","",i) for i in string_list]
            string_list = [i for i in string_list if len(i) > 0][0]
        return string_list

修改settings.py文件

添加user_agent

user_agent = 'mozilla/5.0 (windows nt 6.1; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/70.0.3538.102 safari/537.36'

启用pipelines

item_pipelines = {
   'cnblogs.pipelines.cnblogspipeline': 300,
}

运行程序

执行下面的命令，开始运行程序。

scrapy crawl cn

程序运行结果

程序运行结束后，mongodb中的数据如下图所示，采用的可视化工具是robo 3t。
Scrapy爬取博客园精华区内容

感谢大家的阅读，如果文中有不正确的地方，希望大家指出，我会积极地学习、改正。
再次感谢您耐心的读完本篇文章。

上一篇：【原】Oracle EBS 11无法打开Form及Form显示乱码的解决

下一篇： numpy的操作

Scrapy爬取博客园精华区内容

程序爬取目标

程序环境

创建工程

创建爬虫文件

编写items.py文件

编写爬虫文件cn.py

编写pipelines.py文件

修改settings.py文件

运行程序

程序运行结果

python爬取网页内容转换为PDF文件

scrapy 爬取纵横网实战

Scrapy案例02-腾讯招聘信息爬取

博客园搜索爬取

爬取博客园文章的两个案例，写入sql server数据库

Python使用Scrapy爬虫框架全站爬取图片并保存本地的实现代码

爬虫 Scrapy框架爬取图虫图片并下载

用Scrapy帮妹子爬取王者皮肤海报~

爬虫(十七)：Scrapy框架(四) 对接selenium爬取京东商品数据

Scrapy基于selenium结合爬取淘宝的实例讲解

Scrapy爬取博客园精华区内容

程序爬取目标

程序环境

创建工程

创建爬虫文件

编写items.py文件

编写爬虫文件cn.py

编写pipelines.py文件

修改settings.py文件

运行程序

程序运行结果

python爬取网页内容转换为PDF文件

scrapy 爬取纵横网实战

Scrapy案例02-腾讯招聘信息爬取

博客园搜索爬取

爬取博客园文章的两个案例，写入sql server数据库

Python使用Scrapy爬虫框架全站爬取图片并保存本地的实现代码

爬虫 Scrapy框架 爬取图虫图片并下载

用Scrapy帮妹子爬取王者皮肤海报~

爬虫(十七)：Scrapy框架(四) 对接selenium爬取京东商品数据

Scrapy基于selenium结合爬取淘宝的实例讲解

爬虫 Scrapy框架爬取图虫图片并下载