欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Java爬虫框架gecco的自定义用法

程序员文章站 2022-05-05 12:52:04
...

最近要用Java做一款爬虫程序,在网上搜了搜,选择了使用gecco爬虫框架,基础使用起来很方便快捷,可以参考一下官方案例小案例,非常方便,直接用注解的方法来爬取信息。但是后来我的需求改变了要手动配置,不能把要爬取的网站和规则写死,所以我就研究了一下如何手动配置gecco.手动配置的gecco框架案例是DynamicGecco,但是这个东西我研究了两天才研究出来。下面我把我的一些代码贴出来,大家可以参照着官方案例了解一下。

package com.nieyb.gecco;

import com.geccocrawler.gecco.GeccoEngine;
import com.geccocrawler.gecco.dynamic.DynamicGecco;
import com.geccocrawler.gecco.request.HttpGetRequest;

public class newsListStart {


    public static void start(String newslisturl ,String newslisturlrule,String newslisttitle,String titleurl,
                             String newslistrule , String newsrule , String titlerule , String contentrule ,String formrule,String Encoding) {
        //对应原来的Category和HrefBean类

        Class<?> newstitle = DynamicGecco.html()
                .stringField("parentName").csspath(newslisttitle).text().build()
                .listField("categorys",
                        DynamicGecco.html()
                                .stringField("url").csspath("a").href(true).build()
                                .stringField("title").csspath("a").text().build()
                                .register()).csspath(titleurl).build()
                .register();

        //对应ProductList类
        DynamicGecco.html()
                .gecco(newslisturlrule, "consolePipeline")
                .requestField("request").request().build()
                .listField("newslist", newstitle).csspath(newslistrule).build()
                .register();

        //对应ProductDetail类
        DynamicGecco.html()
                .gecco(newsrule,  "newsContentPipelines")
                //.stringField("code").requestParameter().build()
                .requestField("request").request().build()
                .stringField("news").csspath(contentrule).build()
                .stringField("title").csspath(titlerule).text().build()
                .stringField("form").csspath(formrule).text().build()
                //.stringField("culmn").image()
                .register();

        HttpGetRequest start = new HttpGetRequest(newslisturl);
        start.setCharset(Encoding);
        GeccoEngine.create()
                .classpath("com.nieyb.gecco")
                .start(start)
                .interval(200)
                .run();

    }
}

两个业务处理类分别是:

package com.nieyb.gecco;

import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import com.geccocrawler.gecco.annotation.PipelineName;
import com.geccocrawler.gecco.pipeline.JsonPipeline;
import com.geccocrawler.gecco.request.HttpGetRequest;
import com.geccocrawler.gecco.request.HttpRequest;

@PipelineName(value="newsListPipelines")
public class NewsListPipelines extends JsonPipeline {

    @Override
    public void process(JSONObject jsonObject) {
        HttpRequest currRequest = HttpGetRequest.fromJson(jsonObject.getJSONObject("request"));
        JSONArray newslist = jsonObject.getJSONArray("newslist");
    }

}

package com.nieyb.gecco;


import com.alibaba.fastjson.JSONObject;
import com.geccocrawler.gecco.annotation.PipelineName;
import com.geccocrawler.gecco.pipeline.JsonPipeline;
import com.geccocrawler.gecco.request.HttpGetRequest;
import com.geccocrawler.gecco.request.HttpRequest;
import com.nieyb.model.Article;
import com.nieyb.model.ArticleDTO;
import com.nieyb.model.Dictionary;
import com.nieyb.service.ArticleService;
import com.nieyb.service.DictionaryService;
import com.nieyb.utils.SpringContextHolder;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Component;


@Slf4j
@Component
@PipelineName(value = "newsContentPipelines")
public class NewsCountentPopeline extends JsonPipeline {

    @Override
    public void process(JSONObject newsContent) {
        HttpRequest currRequest = HttpGetRequest.fromJson(newsContent.getJSONObject("request"));
        String title = newsContent.getString("title");
        String form = newsContent.getString("form");
        String news = newsContent.getString("news");
       

        }


    }

   


}

上面的三个类就是全部代码了。但是由于业务处理类相当于是一个工具类,在将爬取到的数据放到数据库时注入service并没有成功。大家可以参考一下我的下一篇博客"工具类中引用注解"。不注入,直接扫描进来。