java爬虫Gecco工具抓取新闻实例

程序员文章站 2024-03-11 21:44:49

最近看到gecoo爬虫工具，感觉比较简单好用，所有写个demo测试一下，抓取网站，主要抓取新闻的标题和发布时间做为抓取测试对象。抓取html节点通过像jquery选择...

最近看到gecoo爬虫工具，感觉比较简单好用，所有写个demo测试一下，抓取网站
，主要抓取新闻的标题和发布时间做为抓取测试对象。抓取html节点通过像jquery选择器一样选择节点，非常方便，gecco代码主要利用注解实现来实现url匹配，看起来比较简洁美观。

添加maven依赖

<dependency>
   <groupid>com.geccocrawler</groupid>
   <artifactid>gecco</artifactid>
   <version>1.0.8</version>
</dependency>

编写抓取列表页面

@gecco(matchurl = "http://zj.zjol.com.cn/home.html?pageindex={pageindex}&pagesize={pagesize}",pipelines = "zjnewslistpipelines")
public class zjnewsgeccolist implements htmlbean {
  @request
  private httprequest request;
  @requestparameter
  private int pageindex;
  @requestparameter
  private int pagesize;
  @htmlfield(csspath = "#content > div > div > div.con_index > div.r.main_mod > div > ul > li > dl > dt > a")
  private list<hrefbean> newlist;
}

@pipelinename("zjnewslistpipelines")
public class zjnewslistpipelines implements pipeline<zjnewsgeccolist> {
  public void process(zjnewsgeccolist zjnewsgeccolist) {
    httprequest request=zjnewsgeccolist.getrequest();
    for (hrefbean bean:zjnewsgeccolist.getnewlist()){
      //进入祥情页面抓取
    schedulercontext.into(request.subrequest("http://zj.zjol.com.cn"+bean.geturl()));
    }
    int page=zjnewsgeccolist.getpageindex()+1;
    string nexturl = "http://zj.zjol.com.cn/home.html?pageindex="+page+"&pagesize=100";
    //抓取下一页
    schedulercontext.into(request.subrequest(nexturl));
  }
}

编写抓取祥情页面

@gecco(matchurl = "http://zj.zjol.com.cn/news/[code].html" ,pipelines = "zjnewsdetailpipeline")
public class zjnewsdetail implements htmlbean {

  @text
  @htmlfield(csspath = "#headline")
  private string title ;

  @text
  @htmlfield(csspath = "#content > div > div.news_con > div.news-content > div:nth-child(1) > div > p.go-left.post-time.c-gray")
  private string createtime;
}

@pipelinename("zjnewsdetailpipeline")
public class zjnewsdetailpipeline implements pipeline<zjnewsdetail> {
  public void process(zjnewsdetail zjnewsdetail) {
    system.out.println(zjnewsdetail.gettitle()+" "+zjnewsdetail.getcreatetime());
  }
}

启动主函数

public class main {
  public static void main(string [] rags){
    geccoengine.create()
        //工程的包路径
        .classpath("com.zhaochao.gecco.zj")
        //开始抓取的页面地址
        .start("http://zj.zjol.com.cn/home.html?pageindex=1&pagesize=100")
        //开启几个爬虫线程
        .thread(10)
        //单个爬虫每次抓取完一个请求后的间隔时间
        .interval(10)
        //使用pc端useragent
        .mobile(false)
        //开始运行
        .run();
  }
}

抓取结果

java爬虫Gecco工具抓取新闻实例

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持。

上一篇：一道Java集合框架题多种解题思路

下一篇： java 删除文件夹中的所有内容而不删除文件夹本身的实例

java爬虫Gecco工具抓取新闻实例

java爬虫Gecco工具抓取新闻实例

教您使用java爬虫gecco抓取JD全部商品信息（一）

教您使用java爬虫gecco抓取JD全部商品信息（三）

教您使用java爬虫gecco抓取JD全部商品信息（二）

教您使用java爬虫gecco抓取JD全部商品信息（三）

教您使用java爬虫gecco抓取JD全部商品信息（一）

教您使用java爬虫gecco抓取JD全部商品信息（二）

教您使用java爬虫gecco抓取JD全部商品信息（二）