欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  网络运营

爬虫scrapy框架 爬取一药网

程序员文章站 2022-03-06 18:41:29
cmd 命令创建项目scrapy startproject yiyaowangcd yiyaowangscrapy genspider yaowang yaowang.com先进入settings.py文件将服从爬虫协议改成False,因为有些网站不盖爬取不了,因此都改了# Obey robots.txt rulesROBOTSTXT_OBEY = Falseitems.py# -*- coding: utf-8 -*-# Define here the models for you...

cmd 命令创建项目

scrapy startproject yiyaowang
cd yiyaowang
scrapy genspider yaowang yaowang.com

先进入settings.py文件将服从爬虫协议改成False,因为有些网站不盖爬取不了,因此都改了

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class YiyaowangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # 定义药品名
    title = scrapy.Field()
    # 定义药品价格
    price = scrapy.Field()
    # 定义评论数
    comment = scrapy.Field()
    pass

yaowang.py

# -*- coding: utf-8 -*-
import scrapy
from ..items import YiyaowangItem

class YaowangSpider(scrapy.Spider):
    name = 'yaowang'
    # allowed_domains = ['yaowang.com']

    # 分页:找URL规律
    base_url = 'https://www.111.com.cn/categories/953710-j{}.html'
    start_urls = []
    for i in range(1,51):
        start_urls.append(base_url.format(i))

    def parse(self, response):
        # 实例化对象
        item = YiyaowangItem()

        # 提取数据
        li_list = response.xpath('//ul[@id="itemSearchList"]/li')
        for li in li_list:
            # 获取药品名
            title = li.xpath('.//p[@class="titleBox"]/a/text()').extract()[1].strip()
            # 发现问题:一片空白
            # 分析:
            # 1. xpath路径问题
            # 2. 使用xpath获取值的时候,列表中的第一个元素是空白字符
            # 解决:
            # 使用extract()或者getall()获取列表,取出我们想要的数据即可

            # 获取药品价格
            # price = li.xpath('.//p[@class="price"]/span/text()').extract_first().strip()
            # 发现问题:
            # 有的药品有价格,有的价格为None
            # 经过在页面中的查看,发现,价格为None的药品,其实是有真实价格的
            # 所以,断定,xpath路径有问题。爬虫爬取的是网页源代码,我们看网页源代码,价格的span外面还有一侧标签
			price = li.xpath('.//p[@class="price"]//span/text()').extract_first().strip()
            # 获取评论数
            comment = li.xpath('.//a[@id="pdlink3"]/em/text()').get()

            item['title'] = title
            item['price'] = price
            item['comment'] = comment

            yield item
            # print(title,price,comment)
            pass

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

# 保存MongoDB数据库
import pymongo

class YiyaowangPipeline(object):
    def open_spider(self,spider):
        # 1. 链接数据库
        self.client = pymongo.MongoClient(host='10.10.34.163',port=27017)
        # 2. 进入数据库
        self.db = self.client['yiyaowang']
        # 3. 进入集合
        self.col = self.db['yaowang']
        pass
    def process_item(self, item, spider):
        # 插入数据
        self.col.insert(dict(item))
        return item
    # 关闭数据库
    def close_spider(self,spider):
        self.client.close()

最后记得将settings.py 里面ITEM_PIPELINES注释解开,不解开的话是不会执行pipelinnes文件的

ITEM_PIPELINES = {
   'yiyaowang.pipelines.YiyaowangPipeline': 300,
}

本文地址:https://blog.csdn.net/weixin_50199986/article/details/110839708

相关标签: python 爬虫