欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

scrapy

程序员文章站 2022-03-02 22:29:44
...

scrapy

支持库

1. wheel
pip install wheel
2. lxml
http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
3. PyOpenSSL
https://pypi.python.org/pypi/pyOpenSSl#downloads
4. Twisted
http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
5. Pywin32
https://sourceforge.net/projects/pywin32/files/pywin32/Bulid%20220/
6. Scrapy
pip install scrapy

安装完成wheel后可以通过wheel安装一些wheel软件。

PS

出现如下提示时:You are using pip version 18.1, however version 19.1.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

以上是老版本,python3.7,升级了pip19.1以后,都可以使用pip install进行安装。

1. wheel
pip install wheel
2. lxml
pip install lxml
3. PyOpenSSL
pip install PyOpenSSL
4. Twisted
http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
这个是下载安装的。3.7 版本。
5. Pywin32
pip install Pywin32
6. Scrapy
pip install scrapy

安装完成以后。测试一下。

scrapy startproject hello
cd hello
scrapy genspider baidu http://www.baidu.com
scrapy crawl baidu

还可以用 anaconda

科学计算环境。

没有安装成功

scrapy 官方提供的爬虫练手网站。

http://quotes.toscrape.com/

scrapy startproject hello
cd hello
scrapy genspider first quotes.toscrape.com
scrapy crawl first

编辑 first.py

def parse(self, response):
    quotes = response.css('.quote')
    for quote in quotes:
        text =  quote.css('.text::text').extract_first
        author =  quote.css('.author::text').extract_first
        tags = quote.css('.tags .tag::text').extract

scrapy 还提供了一个shell,可以在终端输入

scrapy shell quotes.toscrape.com
[1]: resonse
[2]: response.css('.quote')
[3]: quote = response.css('.quote')
[4]: quote[0]
[5]: quote[0].css.('.text').extract_first()
......

就可以进行交互式对话。

输出

    //输出为json格式
    scrapy crawl quotes -o quotes.json
    //一行一行的json格式
    scrapy crawl quotes -o quotes.jl
    //存储成为csv格式
    scrapy crawl quotes -o quotes.csv     
    
    scrapy crawl quotes -o quotes.xml
    
    scrapy crawl quotes -o quotes.pickle
    
    scrapy crawl quotes -o quotes.marshal
    //还支持远程ftp
    scrapy crawl quotes -o ftp://user:[email protected]/path/quotes.csv
再次充实代码
    def parse(self, response):
    quotes = response.css('.quote')
    for quote in quotes:
        item = HelloItem()
        text =  quote.css('.text::text').extract_first()
        author =  quote.css('.author::text').extract_first()
        tags = quote.css('.tags .tag::text').extract()
        item['text'] = text
        item['author'] =  author
        item['tags'] = tags
        yield item

ps:
pycharm 导数据包的快捷键 ,alt+回车。

加上翻页的代码

# -*- coding: utf-8 -*-
import scrapy

from hello.items import HelloItem


class FirstSpider(scrapy.Spider):
    name = 'first'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = HelloItem()
            text =  quote.css('.text::text').extract_first()
            author =  quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()
            item['text'] = text
            item['author'] =  author
            item['tags'] = tags
            yield item

        next = response.css('.pager .next a::attr(href)').extract_first()
        url = response.urljoin(next)
        yield scrapy.Request(url = url,  callback = self.parse)

item 的配置

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

    
class HelloItem(scrapy.Item):
    # define the fields for your item here like:
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

pipline

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class TextPipeline(object):
    def __init__(self):
        self.limit = 50

    def process_item(self, item, spider):
        if item['text']:
            if len(item['text'])>self.limit
                item['text'] = item['text'][0:self.limit].rstript()+'...'
            return  item
        else:
            return DropItem('Missing Text')

class MongoPipline(object):
def __init__(self,mongo_uri,mongo_db):
    self.mongo_uri = mongo+mongo_uri
    self.mongo_db =mongo + mongo_db
def from_crawl(self,crawler):
    return  cls(
        mongo_uri=crawler.settings.get('')
        mongo_db=crawler.settings.get('')
    )

def open

转载于:https://www.jianshu.com/p/942aaf719302