爬虫-Scrapy

程序员文章站 2022-05-06 19:47:09

...

1.创建项目

（1）建立文件夹mztu
（2）cmd desktop: scrapy startproject mztu

1.1 在settings中添加

USER_AGENT = ‘User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/’
‘537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36’

1.2 设置不遵守反爬虫协议

ROBOTSTXT_OBEY = False

1.3 在settings中设置字符编码

FEED_EXPORT_ENCODING=‘UTF-8’

1.4 构建请求头（随机）

from fake_useragent import UserAgent
USER_AGENT = UserAgent().random

1.5 等待时间1s

DOWNLOAD_DELAY = 1

1.6 伪造cookie

1.7 IP代理

2.定义Item容器

（1）获取网址 url ; 归纳想要获得的内容，建立数据库

create table catalogue(
id int primary key auto_increment,
movie_id int unsigned default 0,
title varchar(40) not null default '',
director varchar(40) not null default '',
url varchar(150) not null default '',
casts varchar(150) not null default '',
cover varchar(150) not null default '',
rate decimal(2,1) not null default 0.0,
star MEDIUMINT unsigned not null default 0,
cover_x MEDIUMINT unsigned not null default 0
)engine innodb charset utf8;

（2）Item中建立相应字段：

class MmozItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()

3.编写爬虫

（1）在spider文件夹中创建Mmoz_spider.py文件

class MmozSpider(scrapy.Spider):
    name = 'mmoz'  # 名字必须是唯一的，用来确认蜘蛛的名字
    allowed_domains = ['mzitu.com'] # 限定爬虫的范围
    start_urls = [
        'https://www.mzitu.com/hot'

    ]               #  爬虫开始的位置
    # 爬虫开始的位置

（2）尝试寻获共同点，使用cmd scrapy.shell

（2.1）进入cmd 的 mztu根目录：
（2.2）scrapy shell "https://www.mzitu.com/xinggan"，并得到response响应，
使用response.body 查看网站内容等（headers）
（2.2.3）遇到编码问题时，使用 response.body.decode(‘utf8’)
（2.2.4）当response返回json文件时，scrapy的response.text返回的是str，要将json文件转换为字典

json.loads(response.text)['data']['directors']  
# 取出json文件中的数据

（2.3）使用Xpath查找特定内容
.extract():去掉两侧标签

response.xpath('head/title/text()').extract()
response.xpath('/html/body/div[2]/div[1]/div[2]/nav/div/a[‘’]').extract()
response.xpath('/html/body/div[2]/div[1]/div[4]/a[5]/span').extract()

/html/body/div[2]/div[1]/div[2]/nav/div/a[7]

具体方法有：
/html/head/title:选择HTML中标签内的

元素； /html/head/title/text():<title>中的文字； //td:选择所有的<td>元素； //div[@class=”mine”]:选择所有具有class=”mine”属性的div元素。 使用审查元素中的copy xpath （2.4）提取数据 使用response自动初始化的sel变量 link列表：sel.xpath(’//[@id=“pins”]/li/a/img/@data-original’).extract() title列表：sel.xpath(’//[@id=“pins”]/li/a/img/@alt’).extract()</td>

3.编写爬虫

编写parse分析方法：

def parse(self,response):  # 分析的方法
    sel = scrapy.selector.Selector(response)
    sites = sel.xpath('//*[@id="pins"]/li/a/img')

    for site in sites:
        link = site.xpath('@data-original').extract()
        title = site.xpath('@alt').extract()

4.储存内容

（1）导入item模块：

from mztu.items import MmozItem

（2）将内容导入items:

def parse(self,response):  # 分析的方法
    sel = scrapy.selector.Selector(response)
    sites = sel.xpath('//*[@id="pins"]/li/a/img')
    items = []
    for site in sites:
        item = MmozItem()
        item['link'] = site.xpath('@data-original').extract()
        item['title'] = site.xpath('@alt').extract()
        items.append(item)
    return items

（3）生成文件：
在mztu目录下运行scrapy crawl mmoz
导出为json文件：scrapy crawl qsbk -o itms.json -t json
4.1 将文件储存到mysql
4.1.1在settings中开启piplines

ITEM_PIPELINES = {
   'douban.pipelines.DoubanPipeline': 300,
}

4.1.2在piplines中写入异步储存或同步储存代码

5.设定页面循环

6.爬取图片时，链接范围可能超出allowed_domains ，所以要

dont_filter=True
yield scrapy.Request(article_url, headers=header, callback=self.image_download, dont_filter=True)

相关标签：爬虫 python scrapy 爬虫教程

上一篇： PHP 输出控制

下一篇： php实现扫描二维码根据浏览器类型访问不同下载地址，php浏览器类型

爬虫-Scrapy

1.创建项目

1.1 在settings中添加

1.2 设置不遵守反爬虫协议

1.3 在settings中设置字符编码

1.4 构建请求头（随机）

1.5 等待时间1s

1.6 伪造cookie

1.7 IP代理

2.定义Item容器

（1）获取网址 url ; 归纳想要获得的内容，建立数据库

（2）Item中建立相应字段：

3.编写爬虫

（1）在spider文件夹中创建Mmoz_spider.py文件

（2）尝试寻获共同点，使用cmd scrapy.shell

3.编写爬虫

4.储存内容

（1）导入item模块：

（2）将内容导入items:

PHP实现简单爬虫的方法

python实现爬虫下载美女图片

爬虫工具是什么，最新爬虫工具排行榜

Python制作爬虫采集小说

selenium python虚拟点击网页爬虫翻页功能 href=javascript:void(0)怎么翻页

python scrapy框架爬取80s保存mysql

Python爬虫实战之12306抢票开源

python制作最美应用的爬虫

python制作花瓣网美女图片爬虫

c#爬虫爬取京东的商品信息

爬虫-Scrapy

1.创建项目

1.1 在settings中添加

1.2 设置不遵守反爬虫协议

1.3 在settings中设置字符编码

1.4 构建请求头（随机）

1.5 等待时间1s

1.6 伪造cookie

1.7 IP代理

2.定义Item容器

（1）获取网址 url ; 归纳想要获得的内容，建立数据库

（2）Item中建立相应字段：

3.编写爬虫

（1）在spider文件夹中创建Mmoz_spider.py文件

（2）尝试寻获共同点，使用cmd scrapy.shell

3.编写爬虫

4.储存内容

（1）导入item模块：

（2）将内容导入items:

PHP实现简单爬虫的方法

python实现爬虫下载美女图片

爬虫工具是什么，最新爬虫工具排行榜

Python制作爬虫采集小说

selenium python虚拟点击网页 爬虫翻页功能 href=javascript:void(0)怎么翻页

python scrapy框架爬取80s保存mysql

Python爬虫实战之12306抢票开源

python制作最美应用的爬虫

python制作花瓣网美女图片爬虫

c#爬虫爬取京东的商品信息

selenium python虚拟点击网页爬虫翻页功能 href=javascript:void(0)怎么翻页