爬虫scrapy框架爬取一药网

程序员文章站 2022-03-06 18:41:29

cmd 命令创建项目scrapy startproject yiyaowangcd yiyaowangscrapy genspider yaowang yaowang.com先进入settings.py文件将服从爬虫协议改成False，因为有些网站不盖爬取不了，因此都改了# Obey robots.txt rulesROBOTSTXT_OBEY = Falseitems.py# -*- coding: utf-8 -*-# Define here the models for you...

cmd 命令创建项目

scrapy startproject yiyaowang
cd yiyaowang
scrapy genspider yaowang yaowang.com

先进入settings.py文件将服从爬虫协议改成False，因为有些网站不盖爬取不了，因此都改了

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class YiyaowangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # 定义药品名
    title = scrapy.Field()
    # 定义药品价格
    price = scrapy.Field()
    # 定义评论数
    comment = scrapy.Field()
    pass

yaowang.py

# -*- coding: utf-8 -*-
import scrapy
from ..items import YiyaowangItem

class YaowangSpider(scrapy.Spider):
    name = 'yaowang'
    # allowed_domains = ['yaowang.com']

    # 分页：找URL规律
    base_url = 'https://www.111.com.cn/categories/953710-j{}.html'
    start_urls = []
    for i in range(1,51):
        start_urls.append(base_url.format(i))

    def parse(self, response):
        # 实例化对象
        item = YiyaowangItem()

        # 提取数据
        li_list = response.xpath('//ul[@id="itemSearchList"]/li')
        for li in li_list:
            # 获取药品名
            title = li.xpath('.//p[@class="titleBox"]/a/text()').extract()[1].strip()
            # 发现问题：一片空白
            # 分析：
            # 1. xpath路径问题
            # 2. 使用xpath获取值的时候，列表中的第一个元素是空白字符
            # 解决：
            # 使用extract()或者getall()获取列表，取出我们想要的数据即可

            # 获取药品价格
            # price = li.xpath('.//p[@class="price"]/span/text()').extract_first().strip()
            # 发现问题：
            # 有的药品有价格，有的价格为None
            # 经过在页面中的查看，发现，价格为None的药品，其实是有真实价格的
            # 所以，断定，xpath路径有问题。爬虫爬取的是网页源代码，我们看网页源代码，价格的span外面还有一侧标签
			price = li.xpath('.//p[@class="price"]//span/text()').extract_first().strip()
            # 获取评论数
            comment = li.xpath('.//a[@id="pdlink3"]/em/text()').get()

            item['title'] = title
            item['price'] = price
            item['comment'] = comment

            yield item
            # print(title,price,comment)
            pass

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

# 保存MongoDB数据库
import pymongo

class YiyaowangPipeline(object):
    def open_spider(self,spider):
        # 1. 链接数据库
        self.client = pymongo.MongoClient(host='10.10.34.163',port=27017)
        # 2. 进入数据库
        self.db = self.client['yiyaowang']
        # 3. 进入集合
        self.col = self.db['yaowang']
        pass
    def process_item(self, item, spider):
        # 插入数据
        self.col.insert(dict(item))
        return item
    # 关闭数据库
    def close_spider(self,spider):
        self.client.close()

最后记得将settings.py 里面ITEM_PIPELINES注释解开，不解开的话是不会执行pipelinnes文件的

ITEM_PIPELINES = {
   'yiyaowang.pipelines.YiyaowangPipeline': 300,
}

本文地址：https://blog.csdn.net/weixin_50199986/article/details/110839708

爬虫scrapy框架爬取一药网

python scrapy框架爬取80s保存mysql

使用爬虫框架scrapy爬取网站妹子图

Python利用Scrapy框架爬取豆瓣电影示例

python scrapy框架爬取80s保存mysql

使用scrapy框架爬取桌面背景图片

scrapy 爬取纵横网实战

Python使用Scrapy爬虫框架全站爬取图片并保存本地的实现代码

爬虫 Scrapy框架爬取图虫图片并下载

爬虫(十七)：Scrapy框架(四) 对接selenium爬取京东商品数据

荐 Python爬虫：基于Scrapy爬取京东商品数据并保存到mysql且下载商品图片

爬虫scrapy框架 爬取一药网

python scrapy框架爬取80s保存mysql

使用爬虫框架scrapy爬取网站妹子图

Python利用Scrapy框架爬取豆瓣电影示例

python scrapy框架爬取80s保存mysql

使用scrapy框架爬取桌面背景图片

scrapy 爬取纵横网实战

Python使用Scrapy爬虫框架全站爬取图片并保存本地的实现代码

爬虫 Scrapy框架 爬取图虫图片并下载

爬虫(十七)：Scrapy框架(四) 对接selenium爬取京东商品数据

荐 Python爬虫：基于Scrapy爬取京东商品数据并保存到mysql且下载商品图片

爬虫scrapy框架爬取一药网

爬虫 Scrapy框架爬取图虫图片并下载