欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

scrapy爬虫实例(1)

程序员文章站 2022-05-06 18:47:27
...

爬虫实例

  1. 预先设置好items
import scrapy
class SuperspiderItem(scrapy.Item):
    title = scrapy.Field()
    date = scrapy.Field()
    content = scrapy.Field()
  1. 爬取范围和start_url
class Spider1Spider(scrapy.Spider):
    name = 'spider1'
    allowed_domains = ['http://wz.sun0769.com/']
    start_urls = ['http://wz.sun0769.com/html/top/report.shtml']
  1. parse实现三大大功能抓取具体内容url链接和下一页url链接,并提取title和date
    def parse(self, response):
        tr_list = response.xpath("//div[@class='newsHead clearfix']/table[2]//tr")
        for tr in tr_list:
            items = SuperspiderItem()
            items['title'] = tr.xpath("./td[3]/a[1]/@title").extract_first()  ##### 提取title  用xpath
            items['date'] = tr.xpath("./td[6]//text()").extract_first()    #### 同样的方法提取date
            content_href = tr.xpath("./td[3]/a[1]/@href").extract_first()   #### 提取内容链接
  ####---将提取的内容链接交给下一个函数,并将date和title也交给下一个函数最终数据统一处理---#########
  ####---有关yiled----####----content_url传url链接,callback指定回调函数----####
            yield scrapy.Request(
                content_href,
                callback=self.get_content,
     ####----meta-可以将数据转移----####
     ####----一个类字典的数据类型----####
                meta={
                    'date': items['date'],
                    'title': items['title']
                      }
            )
        new_url = response.xpath("//div[contains(@align,'center')]//@href").extract()
        print(new_url[-2])
        if "page="+str(page_num*30) not in new_url[-2]:
   ####---指明爬取的页数---####
            yield scrapy.Request(
                new_url[-2],
                callback=self.parse
            )
  1. 第二个函数
    -汇集所有的函数并 传给piplines
    def get_content(self, response):
        items = SuperspiderItem()
        items['date'] = response.meta['date']
        items['title'] = response.meta['title']
        items['content'] = response.xpath("//td[@class='txt16_3']/text()").extract_first()
        yield items
  1. piplines里面并没做什么.因为没对数据进行什么处理,只是简单的将数据打印
class SuperspiderPipeline(object):
    def process_item(self, item, spider):
        items = item
        print('*'*100)
        print(items['date'])
        print(items['title'])
        print(items['content'])

完整代码

  • items里面的部分

import scrapy

class SuperspiderItem(scrapy.Item):
    title = scrapy.Field()
    date = scrapy.Field()
    content = scrapy.Field()
  • spider代码
# -*- coding: utf-8 -*-
import scrapy
from superspider.items import SuperspiderItem
page_num = 3
class Spider1Spider(scrapy.Spider):
    name = 'spider1'
    allowed_domains = ['wz.sun0769.com']
    start_urls = ['http://wz.sun0769.com/html/top/report.shtml']

    def parse(self, response):
        tr_list = response.xpath("//div[@class='newsHead clearfix']/table[2]//tr")
        for tr in tr_list:
            items = SuperspiderItem()
            items['title'] = tr.xpath("./td[3]/a[1]/@title").extract_first()
            items['date'] = tr.xpath("./td[6]//text()").extract_first()
            content_href = tr.xpath("./td[3]/a[1]/@href").extract_first()
            yield scrapy.Request(
                content_href,
                callback=self.get_content,
                meta={
                    'date': items['date'],
                    'title': items['title']
                      }
            )
        new_url = response.xpath("//div[contains(@align,'center')]//@href").extract()
        print(new_url[-2])
        if "page="+str(page_num*30) not in new_url[-2]:
            yield scrapy.Request(
                new_url[-2],
                callback=self.parse
            )

    def get_content(self, response):
        items = SuperspiderItem()
        items['date'] = response.meta['date']
        items['title'] = response.meta['title']
        items['content'] = response.xpath("//td[@class='txt16_3']/text()").extract_first()
        yield items
  • piplines代码
class SuperspiderPipeline(object):
    def process_item(self, item, spider):
        items = item
        print('*'*100)
        print(items['date'])
        print(items['title'])
        print(items['content'])

中间遇到的问题

  • 爬取范围写错而日志等级又设置为warning,导致找不出问题
  • yiled相关内容不清楚
  • 要先导入并初始化一个SuperspiderItem()(加括号)
  • piplines中不需要导入SuperspiderItem()
  • extract()忘写
  • xpath://div[contains(@align,'center')注意写法