scrapy爬虫简单案例
程序员文章站
2022-05-30 20:06:35
...
一、创建文件夹
进入cmd命令行,切到D盘
#cmd
D:
创建article文件夹
mkdir article
二、创建项目
scrapy startproject article
三、创建爬虫主程序
scrapy genspider xinwen www.hbskzy.cn
#命令后面加爬虫名和域名
#不能和项目名同名
四、依次编写items,spider,pipelines,settings
items文件:
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class ArticleItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
link = scrapy.Field()
spider文件:
import scrapy
from article.items import ArticleItem
class XinwenSpider(scrapy.Spider):
name = 'xinwen'
allowed_domains = ['www.hbskzy.cn']
start_urls = ['http://www.hbskzy.cn/index_list.jsp?urltype=tree.TreeTempUrl&wbtreeid=1021']
def parse(self, response):
item = ArticleItem()
for i in response.xpath('//*[@id="vsb_content_2"]/ul/li/div/a/@href').extract():
if i[0]=="i":
item['link'] = 'http://www.hbskzy.cn/' + i
else:
item['link'] = i
yield item
for i in response.xpath('//*[@id="vsb_content_2"]/ul/li/div/a/text()').extract():
item['title'] = i
yield item
pipelines文件:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import csv
class ArticlePipeline:
def process_item(self, item, spider):
with open('生科新闻.csv','a',encoding = 'utf-8')as file:
a = csv.writer(file)
row = [item['title'],item['link']]
a.writerow(row)
settings文件:
# Scrapy settings for article project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'article'
SPIDER_MODULES = ['article.spiders']
NEWSPIDER_MODULE = 'article.spiders'
ITEM_PIPELINES = {'article.pipelines.ArticlePipeline':100}#这个必须加
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'article (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
五、运行爬虫
编写完检查各个文件代码错误,认真检查
然后启动爬虫,先进入cmd命令行输入:
cd /D D:/article/article
scrapy crawl xinwen
网页分析的过程这里不讲,请提前测试调试爬虫,否则容易产生报错信息。
完成后会在article目录下生成一个csv文件:
打开后
上一篇: 下载安装sublime Text3及插件