scrapy
程序员文章站
2022-03-02 22:29:44
...
scrapy
支持库
1. wheel
pip install wheel
2. lxml
http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
3. PyOpenSSL
https://pypi.python.org/pypi/pyOpenSSl#downloads
4. Twisted
http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
5. Pywin32
https://sourceforge.net/projects/pywin32/files/pywin32/Bulid%20220/
6. Scrapy
pip install scrapy
安装完成wheel后可以通过wheel安装一些wheel软件。
PS
出现如下提示时:You are using pip version 18.1, however version 19.1.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
以上是老版本,python3.7,升级了pip19.1以后,都可以使用pip install进行安装。
1. wheel
pip install wheel
2. lxml
pip install lxml
3. PyOpenSSL
pip install PyOpenSSL
4. Twisted
http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
这个是下载安装的。3.7 版本。
5. Pywin32
pip install Pywin32
6. Scrapy
pip install scrapy
安装完成以后。测试一下。
scrapy startproject hello
cd hello
scrapy genspider baidu http://www.baidu.com
scrapy crawl baidu
还可以用 anaconda
科学计算环境。
没有安装成功
scrapy 官方提供的爬虫练手网站。
scrapy startproject hello
cd hello
scrapy genspider first quotes.toscrape.com
scrapy crawl first
编辑 first.py
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
text = quote.css('.text::text').extract_first
author = quote.css('.author::text').extract_first
tags = quote.css('.tags .tag::text').extract
scrapy 还提供了一个shell,可以在终端输入
scrapy shell quotes.toscrape.com
[1]: resonse
[2]: response.css('.quote')
[3]: quote = response.css('.quote')
[4]: quote[0]
[5]: quote[0].css.('.text').extract_first()
......
就可以进行交互式对话。
输出
//输出为json格式
scrapy crawl quotes -o quotes.json
//一行一行的json格式
scrapy crawl quotes -o quotes.jl
//存储成为csv格式
scrapy crawl quotes -o quotes.csv
scrapy crawl quotes -o quotes.xml
scrapy crawl quotes -o quotes.pickle
scrapy crawl quotes -o quotes.marshal
//还支持远程ftp
scrapy crawl quotes -o ftp://user:[email protected]/path/quotes.csv
再次充实代码
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = HelloItem()
text = quote.css('.text::text').extract_first()
author = quote.css('.author::text').extract_first()
tags = quote.css('.tags .tag::text').extract()
item['text'] = text
item['author'] = author
item['tags'] = tags
yield item
ps:
pycharm 导数据包的快捷键 ,alt+回车。
加上翻页的代码
# -*- coding: utf-8 -*-
import scrapy
from hello.items import HelloItem
class FirstSpider(scrapy.Spider):
name = 'first'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = HelloItem()
text = quote.css('.text::text').extract_first()
author = quote.css('.author::text').extract_first()
tags = quote.css('.tags .tag::text').extract()
item['text'] = text
item['author'] = author
item['tags'] = tags
yield item
next = response.css('.pager .next a::attr(href)').extract_first()
url = response.urljoin(next)
yield scrapy.Request(url = url, callback = self.parse)
item 的配置
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class HelloItem(scrapy.Item):
# define the fields for your item here like:
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
pipline
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class TextPipeline(object):
def __init__(self):
self.limit = 50
def process_item(self, item, spider):
if item['text']:
if len(item['text'])>self.limit
item['text'] = item['text'][0:self.limit].rstript()+'...'
return item
else:
return DropItem('Missing Text')
class MongoPipline(object):
def __init__(self,mongo_uri,mongo_db):
self.mongo_uri = mongo+mongo_uri
self.mongo_db =mongo + mongo_db
def from_crawl(self,crawler):
return cls(
mongo_uri=crawler.settings.get('')
mongo_db=crawler.settings.get('')
)
def open
转载于:https://www.jianshu.com/p/942aaf719302
下一篇: scrapy