nokogiri抓取网络资源

程序员文章站 2022-05-05 21:20:26

...

写道

Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors.
XML is like violence - if it doesn’t solve your problems, you are not using enough of it.

Nokogiri的解析能力+open-uri的网络访问想组合就可以用来抓取网络上的一些资源了，下面的这段代码用来抓取清杯浅酌这个wp博客。由于ruby代码写的比较java化，只能通过平时多写多看来提高自己的美感，高手请飘过。

require 'rubygems'
require 'nokogiri'
require 'open-uri'
desc "Fetch articles from http://xuzhuoer.com/"
task :fetch => :environment do
  ids = Nokogiri::HTML(open("http://xuzhuoer.com/archives/"))
  ids.css('.post li a').each_with_index do |link, index|
    href = link.attr("href")
    doc = Nokogiri::HTML(open(href))
    # get the article's content & title & tag_list
    content = doc.css('.post > .content').inner_html
    title = doc.css('h1').text
    tags = ""
    doc.css('.post_info a').each do |tag|
      tags << tag.text << " "
    end
    # create post and save it
    @post = Post.create!(:body => content, :tag_list => tags.strip!, :title => title )
    # get the article's comments
    doc.css('#comments > .comment').each_with_index do |comment, index|
      author = comment.css('.author').text
      unless comment.css('a[@class="author"]').empty?
        author_url = comment.css('a[@class="author"]').attr('href')
      end
      body = comment.css('.content p').text
      # fetch the author's md5(email) to get gravatar
      md5 = comment.css('img').attr('src').text[31...63]
      # create & save comment
      Comment.create!(:author => author, :author_url => author_url,
          :body => body, :avatar_md5 => md5,
          :commentable_type => "Post",  :commentable_id => @post.id)
      sleep(5)
    end
    sleep(rand(5))
  end
end

如果抓取的网站资源需要登陆后才能看到，那么这个方法就显得无能为力了。不过加上Mechanize，结果就可能不一样了。mechazie能够模拟表单的提交并在以后的表单操作中自动设置cookie。

上一篇：扩展js对象数组的OrderByAsc和OrderByDesc方法实现思路_javascript技巧

下一篇：低代码开发平台的敏捷之力

nokogiri抓取网络资源

C#使用HtmlAgilityPack抓取糗事百科内容实例

百度禁止网站使用qq、手机抓取工具!

python使用beautifulsoup从爱奇艺网抓取视频播放

python小技巧之批量抓取美女图片

php使用curl和正则表达式抓取网页数据示例

百度禁止网站使用qq、手机抓取工具!

java使用htmlunit工具抓取js中加载的数据

c# 抓取Web网页数据分析

php抓取页面与代码解析推荐

网站日志中的不完整url或莫名其妙的url抓取的分析

nokogiri抓取网络资源

C#使用HtmlAgilityPack抓取糗事百科内容实例

百度禁止网站使用qq、手机抓取工具!

python使用beautifulsoup从爱奇艺网抓取视频播放

python小技巧之批量抓取美女图片

php使用curl和正则表达式抓取网页数据示例

百度禁止网站使用qq、手机抓取工具!

java使用htmlunit工具抓取js中加载的数据

c# 抓取Web网页数据分析

php抓取页面与代码解析 推荐

网站日志中的不完整url或莫名其妙的url抓取的分析

php抓取页面与代码解析推荐