python爬虫scrapy框架
安装:pip install Scrapy
startproject:创建一个新项目
genspider:根据模板生成一个新爬虫
crawl:执行爬虫
shell:启动交互式抓取控制台
进入项目目录
scrapy startproject CrawlerTest(project name)
cd CrawlerTest
会生成如下文件:
items.py:定义了待抓取域的模型
settings.py:定义了一些设置,如用户代理,抓取延时等等
spiders/:该目录存储实际的爬虫代码
项目配置scrapy.cfg和处理要抓取的域pipelines.py,在这里无须修改
1.items.py修改如下
import scrapy
class CrawlertestItem(scrapy.Item):
# define the fields for your item here like:
#想要爬取的字段name
name = scrapy.Field()
#想要爬取的字段population
population=scrapy.Field()
pass
创建爬虫
通过genspider命令,传入爬虫名,域名,可选模板参数来生成初始模板。
命令如下:
scrapy genspider country CrawlerTest.webscraping.com –template=crawl
country:爬虫名
在CrawlerTest/spiders/country.py中自动生成如下代码:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class CountrySpider(CrawlSpider):
#爬虫名
name = 'country'
#可以爬取的域名列表
allowed_domains = ['CrawlerTest.webscraping.com']
#爬虫起始URL列表
start_urls = ['http://CrawlerTest.webscraping.com/']
#正则表达式集合
rules = (
Rule(LinkExtractor(allow='/index/'), follow=True),
Rule(LinkExtractor(allow='/view/'), callback='parse_item')
)
#从响应中获取数据
def parse_item(self, response):
i = {}
#i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
#i['name'] = response.xpath('//div[@id="name"]').extract()
#i['description'] = response.xpath('//div[@id="description"]').extract()
return i
优化设置
默认情况下,Scrapy对同一域名允许最多8个并发下载,并且两次下载之间没有延时,而当下载速度持续高于每秒一个请求时,爬虫可能暂时被封禁。所以要在settings.py文件中添加请求限制和延时(添加随机偏移量)
CONCURRENT_REQUESTS_PER_DOMAIN=1
DOWNLOAD_DELAY=5
启动爬虫程序
scrapy crawl country -s LOG_LEVEL=DEBUG
显示正确爬取过程,但是浪费了大量时间爬取每个网页上的登录和注册表单链接,可以使用规则的deny函数,再次修改country.py文件,修改如下:
rules = (
Rule(LinkExtractor(allow=’/index/’,deny=’/user/’),follow=True),
Rule(LinkExtractor(allow=’/view/’,deny=’/user/’),callback=’parse_item’)
)
使用shell命令抓取
scrapy shell url
爬虫完整代码:country.py
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class CountrySpider(CrawlSpider):
name = 'country'
allowed_domains = ['CrawlerTest.webscraping.com']
start_urls = ['http://CrawlerTest.webscraping.com/']
rules = (
Rule(LinkExtractor(allow='/index/',deny='/user/'),follow=True),
Rule(LinkExtractor(allow='/view/',deny='/user/'),callback='parse_item')
)
def parse_item(self, response):
item = CrawlertestItem()
name_css='tr#places_country_row td.w2p_fw::text'
item['name'] = response.css(name_css).extract()
pop_css='tr#places_population_row td.w2p_fw::text'
item['population'] = response.css('pop_css').extract()
return item
存储至一个CSV文件中:
scrapy crawl country –output=countries.csv -s LOG_LEVEL=INFO
中断与恢复爬虫
scrapy crawl country -s LOG_LEVEL=DEBUG -s JOBDIR=crawls/country
保存目录:crawls/country
之后可以运行同样的命令恢复爬虫运行(主要适用于爬取大型网站)
上一篇: 学习随笔 Scrapy爬虫框架
下一篇: 爬虫学习笔记 - scrapy 框架