python爬虫 scrapy框架学习

程序员文章站 2022-05-06 20:29:28

...

python爬虫 scrapy框架学习

一、步骤：
新建项目 (Project)：新建一个新的爬虫项目
明确目标（Items）：明确你想要抓取的目标
制作爬虫（Spider）：制作爬虫开始爬取网页
存储内容（Pipeline）：设计管道存储爬取内容

1、新建项目
scrapy startproject filename baidu.com

2、明确目标
在Scrapy中，items是用来加载抓取内容的容器，有点像Python中的Dic，也就是字典，但是提供了一些额外的保护减少错误。
一般来说，item可以用scrapy.item.Item类来创建，并且用scrapy.item.Field对象来定义属性（可以理解成类似于ORM的映射关系）。
接下来，我们开始来构建item模型（model）。
首先，我们想要的内容有：
作者（author）
内容（text）
标签（tags）

3、制作爬虫也是最关键的一步

# -*- coding: utf-8 -*-
import scrapy
import sys
sys.path.append("D:\\pycodes\\quotes")
from quotes.items import quotesItem

class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        for sel in response.xpath('//div[@class="quote"]'):
            item = quotesItem()
            item['text']=sel.xpath('span[@class="text"]/text()').extract()
            item['author']=sel.xpath('span/small/text()').extract()
            item['tags']=sel.xpath('div/a/text()').extract()
            yield item

4、设计通道

通过设计pipeline通道，来处理item数据。

class DoubanPipeline(object):
    def process_item(self, item, spider):
        return item

class DoubanInfoPipeline(object):
    def open_spider(self,spider):
        self.f=open("result.txt","w")

    def close_spider(self,spider):
        self.f.close()

    def process_item(self,item,spider):
        try:
            line = str(dict(item)) + '\n'
            self.f.write(line)
        except:
            pass
        return item

#

1、选择器xpath的使用
response.xpath(//div/@href).extract()
response.xpath(//div[@href]/text()).extract()
response.xpath(//div[contains(@href,”image”)]/@href

若在div下选择不是直系子节点的p，需要
div.xpath(“.//p”)注意加.

2、xpath.re的应用
Selector 也有一个 .re() 方法，用来通过正则表达式来提取数据。然而，不同于使用 .xpath() 或者 .css() 方法, .re() 方法返回unicode字符串的列表。所以你无法构造嵌套式的 .re() 调用。

下面是一个例子，从上面的 HTML code 中提取图像名字:

response.xpath(‘//a[contains(@href, “image”)]/text()’).re(r’Name:\s*(.*)’)
[u’My image 1’,
u’My image 2’,
u’My image 3’,
u’My image 4’,
u’My image 5’]

3、
例如在XPath的 starts-with() 或 contains() 无法满足需求时， test() 函数可以非常有用。

例如在列表中选择有”class”元素且结尾为一个数字的链接:

from scrapy import Selector

doc = “””
…

…

…
first item
…
second item
…
third item
…
fourth item
…
fifth item
…

…

… “””
sel = Selector(text=doc, type=”html”)
sel.xpath(‘//li//@href’).extract()
[u’link1.html’, u’link2.html’, u’link3.html’, u’link4.html’, u’link5.html’]
sel.xpath(‘//li[re:test(@class, “item-\d$”)]//@href’).extract()
[u’link1.html’, u’link2.html’, u’link4.html’, u’link5.html’]

3、for index,link in enumberate(links):
print (index,link)
0 link1
1 link2
…

4、不一定非按照四个步骤来
有时可以默认不改变items.py
直接在spider.py里生成产生的字典，例如：
yield{

等等

5、递归链接，分布爬取,

在parse(self,response):
方法中加入：

next_page=response.xpath("")
if next_page：
    next_page=response.urljoin(next_page)
    yield scrapy.Request(next_page,callback=self.parse)

6、如何防止出现403错误：
需要调节 setting.py 文件
调节USER_AGENT
USER_AGENT = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5’
模拟浏览器访问

python爬虫 scrapy框架学习

python爬虫 scrapy框架学习

#

使用Python的Scrapy框架编写web爬虫的简单示例

Python爬虫包BeautifulSoup学习实例（五）

Python爬虫学习教程：天猫商品数据爬虫

爬虫(十四)：Scrapy框架(一) 初识Scrapy、第一个案例

小白学 Python 爬虫：自动化测试框架 Selenium 从入门到实战

爬虫之scrapy框架

Python爬虫入门教程 31-100 36氪(36kr)数据抓取 scrapy

Python爬虫框架Scrapy基本用法入门教程

爬虫(十六)：Scrapy框架(三) Spider Middleware、Item Pipeline、对接Selenium

详解python3 + Scrapy爬虫学习之创建项目