欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

scrapy爬虫简单案例

程序员文章站 2022-05-30 20:06:35
...

一、创建文件夹

进入cmd命令行,切到D盘

#cmd
D:

创建article文件夹

mkdir article

二、创建项目

scrapy startproject article

三、创建爬虫主程序

scrapy genspider xinwen www.hbskzy.cn
#命令后面加爬虫名和域名
#不能和项目名同名

四、依次编写items,spider,pipelines,settings

items文件:

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ArticleItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    link = scrapy.Field()

spider文件:

import scrapy
from article.items import ArticleItem

class XinwenSpider(scrapy.Spider):
    name = 'xinwen'
    allowed_domains = ['www.hbskzy.cn']
    start_urls = ['http://www.hbskzy.cn/index_list.jsp?urltype=tree.TreeTempUrl&wbtreeid=1021']

    def parse(self, response):
        item = ArticleItem()
        for i in response.xpath('//*[@id="vsb_content_2"]/ul/li/div/a/@href').extract():
            if i[0]=="i":
                item['link'] = 'http://www.hbskzy.cn/' + i
            else:
                item['link'] = i
            yield item
        for i in response.xpath('//*[@id="vsb_content_2"]/ul/li/div/a/text()').extract():
            item['title'] = i
            yield item

pipelines文件:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import csv

class ArticlePipeline:
    def process_item(self, item, spider):
        with open('生科新闻.csv','a',encoding = 'utf-8')as file:
            a = csv.writer(file)
            row = [item['title'],item['link']]
            a.writerow(row)

settings文件:

# Scrapy settings for article project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'article'

SPIDER_MODULES = ['article.spiders']
NEWSPIDER_MODULE = 'article.spiders'
ITEM_PIPELINES = {'article.pipelines.ArticlePipeline':100}#这个必须加

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'article (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

五、运行爬虫

编写完检查各个文件代码错误,认真检查
然后启动爬虫,先进入cmd命令行输入:

cd /D D:/article/article
scrapy crawl xinwen

网页分析的过程这里不讲,请提前测试调试爬虫,否则容易产生报错信息。
完成后会在article目录下生成一个csv文件:
scrapy爬虫简单案例
打开后
scrapy爬虫简单案例

相关标签: python爬虫 python