欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

nokogiri抓取网络资源

程序员文章站 2022-05-05 21:20:26
...

 写道
Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors.
XML is like violence - if it doesn’t solve your problems, you are not using enough of it.
 Nokogiri的解析能力+open-uri的网络访问想组合就可以用来抓取网络上的一些资源了,下面的这段代码用来抓取清杯浅酌这个wp博客。由于ruby代码写的比较java化,只能通过平时多写多看来提高自己的美感,高手请飘过。

require 'rubygems'
require 'nokogiri'
require 'open-uri'
desc "Fetch articles from http://xuzhuoer.com/"
task :fetch => :environment do
  ids = Nokogiri::HTML(open("http://xuzhuoer.com/archives/"))
  ids.css('.post li a').each_with_index do |link, index|
    href = link.attr("href")
    doc = Nokogiri::HTML(open(href))
    # get the article's content & title & tag_list
    content = doc.css('.post > .content').inner_html
    title = doc.css('h1').text
    tags = ""
    doc.css('.post_info a').each do |tag|
      tags << tag.text << " "
    end
    # create post and save it
    @post = Post.create!(:body => content, :tag_list => tags.strip!, :title => title )
    # get the article's comments
    doc.css('#comments > .comment').each_with_index do |comment, index|
      author = comment.css('.author').text
      unless comment.css('a[@class="author"]').empty?
        author_url = comment.css('a[@class="author"]').attr('href')
      end
      body = comment.css('.content p').text
      # fetch the author's md5(email) to get gravatar
      md5 = comment.css('img').attr('src').text[31...63]
      # create & save comment
      Comment.create!(:author => author, :author_url => author_url,
          :body => body, :avatar_md5 => md5,
          :commentable_type => "Post",  :commentable_id => @post.id)
      sleep(5)
    end
    sleep(rand(5))
  end
end
 

如果抓取的网站资源需要登陆后才能看到,那么这个方法就显得无能为力了。不过加上Mechanize,结果就可能不一样了。mechazie能够模拟表单的提交并在以后的表单操作中自动设置cookie。