nokogiri抓取网络资源
程序员文章站
2022-05-05 21:20:26
...
写道
Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors.
XML is like violence - if it doesn’t solve your problems, you are not using enough of it.
Nokogiri的解析能力+open-uri的网络访问想组合就可以用来抓取网络上的一些资源了,下面的这段代码用来抓取清杯浅酌这个wp博客。由于ruby代码写的比较java化,只能通过平时多写多看来提高自己的美感,高手请飘过。
XML is like violence - if it doesn’t solve your problems, you are not using enough of it.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
desc "Fetch articles from http://xuzhuoer.com/"
task :fetch => :environment do
ids = Nokogiri::HTML(open("http://xuzhuoer.com/archives/"))
ids.css('.post li a').each_with_index do |link, index|
href = link.attr("href")
doc = Nokogiri::HTML(open(href))
# get the article's content & title & tag_list
content = doc.css('.post > .content').inner_html
title = doc.css('h1').text
tags = ""
doc.css('.post_info a').each do |tag|
tags << tag.text << " "
end
# create post and save it
@post = Post.create!(:body => content, :tag_list => tags.strip!, :title => title )
# get the article's comments
doc.css('#comments > .comment').each_with_index do |comment, index|
author = comment.css('.author').text
unless comment.css('a[@class="author"]').empty?
author_url = comment.css('a[@class="author"]').attr('href')
end
body = comment.css('.content p').text
# fetch the author's md5(email) to get gravatar
md5 = comment.css('img').attr('src').text[31...63]
# create & save comment
Comment.create!(:author => author, :author_url => author_url,
:body => body, :avatar_md5 => md5,
:commentable_type => "Post", :commentable_id => @post.id)
sleep(5)
end
sleep(rand(5))
end
end
如果抓取的网站资源需要登陆后才能看到,那么这个方法就显得无能为力了。不过加上Mechanize,结果就可能不一样了。mechazie能够模拟表单的提交并在以后的表单操作中自动设置cookie。