scrapy爬虫简单案例

程序员文章站 2022-05-30 20:06:35

...

一、创建文件夹

进入cmd命令行,切到D盘

#cmd
D:

创建article文件夹

mkdir article

二、创建项目

scrapy startproject article

三、创建爬虫主程序

scrapy genspider xinwen www.hbskzy.cn
#命令后面加爬虫名和域名
#不能和项目名同名

四、依次编写items,spider,pipelines,settings

items文件:

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ArticleItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    link = scrapy.Field()

spider文件：

import scrapy
from article.items import ArticleItem

class XinwenSpider(scrapy.Spider):
    name = 'xinwen'
    allowed_domains = ['www.hbskzy.cn']
    start_urls = ['http://www.hbskzy.cn/index_list.jsp?urltype=tree.TreeTempUrl&wbtreeid=1021']

    def parse(self, response):
        item = ArticleItem()
        for i in response.xpath('//*[@id="vsb_content_2"]/ul/li/div/a/@href').extract():
            if i[0]=="i":
                item['link'] = 'http://www.hbskzy.cn/' + i
            else:
                item['link'] = i
            yield item
        for i in response.xpath('//*[@id="vsb_content_2"]/ul/li/div/a/text()').extract():
            item['title'] = i
            yield item

pipelines文件:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import csv

class ArticlePipeline:
    def process_item(self, item, spider):
        with open('生科新闻.csv','a',encoding = 'utf-8')as file:
            a = csv.writer(file)
            row = [item['title'],item['link']]
            a.writerow(row)

settings文件:

# Scrapy settings for article project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'article'

SPIDER_MODULES = ['article.spiders']
NEWSPIDER_MODULE = 'article.spiders'
ITEM_PIPELINES = {'article.pipelines.ArticlePipeline':100}#这个必须加

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'article (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

五、运行爬虫

编写完检查各个文件代码错误，认真检查
然后启动爬虫，先进入cmd命令行输入：

cd /D D:/article/article

scrapy crawl xinwen

网页分析的过程这里不讲，请提前测试调试爬虫，否则容易产生报错信息。
完成后会在article目录下生成一个csv文件：
scrapy爬虫简单案例
打开后

scrapy爬虫简单案例

一、创建文件夹

二、创建项目

三、创建爬虫主程序

四、依次编写items,spider,pipelines,settings

五、运行爬虫

iOS简单画板开发案例分享

PHP实现简单爬虫的方法

JS/jQuery实现简单的开关灯效果【案例】

seajs和requirejs模块化简单案例分析

Python爬虫框架Scrapy实战之批量抓取招聘信息

C# Socket实现简单控制台案例

python爬虫是什么意思（简单好玩的编程代码）

零基础写python爬虫之爬虫框架Scrapy安装配置

PHP一个简单的无需刷新爬虫

使用Python编写简单网络爬虫抓取视频下载资源