使用scrapy爬取古诗文网的前十页数据
程序员文章站
2022-04-22 13:39:58
...
内容简介
使用scrapy爬取古诗文网的前十页数据
创建scrapy框架
使用cmd创建一个爬虫项目
scrapy startproject gsww #创建新项目
然后进入目录中,创建spider
cd gsww
scrapy genspider gsww_spider www.gushiwen.cn
设置scrapy项目
在settings的程序里面设置
ROBOTSTXT_OBEY = False #设置不遵守robots协议
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/5',
'Accept-Language': 'en',
} #设置请求头的headers
ITEM_PIPELINES = {
'gsww.pipelines.GswwPipeline': 300,
}
写爬虫类
class GswwSpiderSpider(scrapy.Spider):
name = 'gsww_spider'
allowed_domains = ['www.gushiwen.cn']
start_urls = ['https://www.gushiwen.cn/default_1.aspx']
page = 1
def myprint(self, value):
print('='*30)
print(value)
print('='*30)
def parse(self, response):
gsw_divs = response.xpath("//div[@class='left']/div[@class='sons']")
for gsw_div in gsw_divs:
# self.myprint(type(response))
# response.xpath返回SelectorList对象
title = gsw_div.xpath('.//b/text()').getall()
title = ''.join(title)
# self.myprint(title)
dynasty = gsw_div.xpath('.//p[@class="source"]/a[1]/text()').getall()
dynasty = ''.join(dynasty)
author = gsw_div.xpath('.//p[@class="source"]/a[2]/text()').getall()
author = ''.join(author)
# 下面的//text()代表的是获取class='contson'下面所有的子孙文本
content_list = gsw_div.xpath(".//div[@class='contson']//text()").getall()
# self.myprint(content_list)
content = "" .join(content_list).strip()
self.myprint(content)
item = GswwItem(title=title, dynasty=dynasty, author=author, content=content)
yield item
设置爬取的内容
class GswwItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
dynasty = scrapy.Field()
author = scrapy.Field()
content = scrapy.Field()
保存数据
import json
class GswwPipeline:
def open_spider(self, spider):
self.fp = open("古诗文.txt", 'w', encoding='utf-8')
def process_item(self, item, spider):
self.fp.write(json.dumps(dict(item),ensure_ascii=False) + "\n")
return item
def close_spider(self, spider):
self.fp.close()
标题设置多页爬取(在gsww_spider.py里面设置)
next_url = response.xpath('//a[@id="amore"]/@href').get()
print(next_url)
if not next_url:
return
else:
yield scrapy.Request('https://www.gushiwen.cn' + next_url, callback=self.parse)
# scrapy.Request(这个网址一定要是str类型的,所以前面就不能使用getall方法来获取,getall方法获取的是一个列表)
程序运行效果
为了方便我们运行方便,可以自己写一个程序放到项目的根目录下
from scrapy import cmdline #导入scrapy下面的cmdline包
# 调用cmdline.execute方法执行运行命令
cmdline.execute("scrapy crawl gsww_spider".split(' '))
运行效果:
最后把源码链接贴出来:
https://download.csdn.net/download/qiaoenshi/12913580
上一篇: 糗出汗来的囧夫妻