Java爬虫框架gecco的自定义用法

程序员文章站 2022-05-05 12:52:04

...

最近要用Java做一款爬虫程序，在网上搜了搜，选择了使用gecco爬虫框架，基础使用起来很方便快捷，可以参考一下官方案例小案例，非常方便，直接用注解的方法来爬取信息。但是后来我的需求改变了要手动配置，不能把要爬取的网站和规则写死，所以我就研究了一下如何手动配置gecco.手动配置的gecco框架案例是DynamicGecco，但是这个东西我研究了两天才研究出来。下面我把我的一些代码贴出来，大家可以参照着官方案例了解一下。

package com.nieyb.gecco;

import com.geccocrawler.gecco.GeccoEngine;
import com.geccocrawler.gecco.dynamic.DynamicGecco;
import com.geccocrawler.gecco.request.HttpGetRequest;

public class newsListStart {


    public static void start(String newslisturl ,String newslisturlrule,String newslisttitle,String titleurl,
                             String newslistrule , String newsrule , String titlerule , String contentrule ,String formrule,String Encoding) {
        //对应原来的Category和HrefBean类

        Class<?> newstitle = DynamicGecco.html()
                .stringField("parentName").csspath(newslisttitle).text().build()
                .listField("categorys",
                        DynamicGecco.html()
                                .stringField("url").csspath("a").href(true).build()
                                .stringField("title").csspath("a").text().build()
                                .register()).csspath(titleurl).build()
                .register();

        //对应ProductList类
        DynamicGecco.html()
                .gecco(newslisturlrule, "consolePipeline")
                .requestField("request").request().build()
                .listField("newslist", newstitle).csspath(newslistrule).build()
                .register();

        //对应ProductDetail类
        DynamicGecco.html()
                .gecco(newsrule,  "newsContentPipelines")
                //.stringField("code").requestParameter().build()
                .requestField("request").request().build()
                .stringField("news").csspath(contentrule).build()
                .stringField("title").csspath(titlerule).text().build()
                .stringField("form").csspath(formrule).text().build()
                //.stringField("culmn").image()
                .register();

        HttpGetRequest start = new HttpGetRequest(newslisturl);
        start.setCharset(Encoding);
        GeccoEngine.create()
                .classpath("com.nieyb.gecco")
                .start(start)
                .interval(200)
                .run();

    }
}

两个业务处理类分别是：

package com.nieyb.gecco;

import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import com.geccocrawler.gecco.annotation.PipelineName;
import com.geccocrawler.gecco.pipeline.JsonPipeline;
import com.geccocrawler.gecco.request.HttpGetRequest;
import com.geccocrawler.gecco.request.HttpRequest;

@PipelineName(value="newsListPipelines")
public class NewsListPipelines extends JsonPipeline {

    @Override
    public void process(JSONObject jsonObject) {
        HttpRequest currRequest = HttpGetRequest.fromJson(jsonObject.getJSONObject("request"));
        JSONArray newslist = jsonObject.getJSONArray("newslist");
    }

}

和

package com.nieyb.gecco;


import com.alibaba.fastjson.JSONObject;
import com.geccocrawler.gecco.annotation.PipelineName;
import com.geccocrawler.gecco.pipeline.JsonPipeline;
import com.geccocrawler.gecco.request.HttpGetRequest;
import com.geccocrawler.gecco.request.HttpRequest;
import com.nieyb.model.Article;
import com.nieyb.model.ArticleDTO;
import com.nieyb.model.Dictionary;
import com.nieyb.service.ArticleService;
import com.nieyb.service.DictionaryService;
import com.nieyb.utils.SpringContextHolder;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Component;


@Slf4j
@Component
@PipelineName(value = "newsContentPipelines")
public class NewsCountentPopeline extends JsonPipeline {

    @Override
    public void process(JSONObject newsContent) {
        HttpRequest currRequest = HttpGetRequest.fromJson(newsContent.getJSONObject("request"));
        String title = newsContent.getString("title");
        String form = newsContent.getString("form");
        String news = newsContent.getString("news");
       

        }


    }

   


}

上面的三个类就是全部代码了。但是由于业务处理类相当于是一个工具类，在将爬取到的数据放到数据库时注入service并没有成功。大家可以参考一下我的下一篇博客"工具类中引用注解"。不注入，直接扫描进来。

上一篇：宋代轶事：神宗皇帝读苏轼词称：他还是爱我的

下一篇： Spring的事务(转)

Java爬虫框架gecco的自定义用法

java Swing JFrame框架类中setDefaultCloseOperation的参数含义与用法示例

Java中Arrays.sort()的三种常用用法（自定义排序规则）

Python爬虫框架之Scrapy中Spider的用法

推荐10款常用的java集合框架用法

Gecco爬虫框架的线程和队列模型

Gecco爬虫框架的线程和队列模型

Gecco 1.1.2 发布，易用的轻量化爬虫 gecco爬虫java

Gecco 1.1.2 发布，易用的轻量化爬虫 gecco爬虫java

推荐10款常用的java集合框架用法

java爬虫gecco的稳定性测试