Java爬虫框架gecco的自定义用法
程序员文章站
2022-05-05 12:52:04
...
最近要用Java做一款爬虫程序,在网上搜了搜,选择了使用gecco爬虫框架,基础使用起来很方便快捷,可以参考一下官方案例小案例,非常方便,直接用注解的方法来爬取信息。但是后来我的需求改变了要手动配置,不能把要爬取的网站和规则写死,所以我就研究了一下如何手动配置gecco.手动配置的gecco框架案例是DynamicGecco,但是这个东西我研究了两天才研究出来。下面我把我的一些代码贴出来,大家可以参照着官方案例了解一下。
package com.nieyb.gecco;
import com.geccocrawler.gecco.GeccoEngine;
import com.geccocrawler.gecco.dynamic.DynamicGecco;
import com.geccocrawler.gecco.request.HttpGetRequest;
public class newsListStart {
public static void start(String newslisturl ,String newslisturlrule,String newslisttitle,String titleurl,
String newslistrule , String newsrule , String titlerule , String contentrule ,String formrule,String Encoding) {
//对应原来的Category和HrefBean类
Class<?> newstitle = DynamicGecco.html()
.stringField("parentName").csspath(newslisttitle).text().build()
.listField("categorys",
DynamicGecco.html()
.stringField("url").csspath("a").href(true).build()
.stringField("title").csspath("a").text().build()
.register()).csspath(titleurl).build()
.register();
//对应ProductList类
DynamicGecco.html()
.gecco(newslisturlrule, "consolePipeline")
.requestField("request").request().build()
.listField("newslist", newstitle).csspath(newslistrule).build()
.register();
//对应ProductDetail类
DynamicGecco.html()
.gecco(newsrule, "newsContentPipelines")
//.stringField("code").requestParameter().build()
.requestField("request").request().build()
.stringField("news").csspath(contentrule).build()
.stringField("title").csspath(titlerule).text().build()
.stringField("form").csspath(formrule).text().build()
//.stringField("culmn").image()
.register();
HttpGetRequest start = new HttpGetRequest(newslisturl);
start.setCharset(Encoding);
GeccoEngine.create()
.classpath("com.nieyb.gecco")
.start(start)
.interval(200)
.run();
}
}
两个业务处理类分别是:
package com.nieyb.gecco;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import com.geccocrawler.gecco.annotation.PipelineName;
import com.geccocrawler.gecco.pipeline.JsonPipeline;
import com.geccocrawler.gecco.request.HttpGetRequest;
import com.geccocrawler.gecco.request.HttpRequest;
@PipelineName(value="newsListPipelines")
public class NewsListPipelines extends JsonPipeline {
@Override
public void process(JSONObject jsonObject) {
HttpRequest currRequest = HttpGetRequest.fromJson(jsonObject.getJSONObject("request"));
JSONArray newslist = jsonObject.getJSONArray("newslist");
}
}
和
package com.nieyb.gecco;
import com.alibaba.fastjson.JSONObject;
import com.geccocrawler.gecco.annotation.PipelineName;
import com.geccocrawler.gecco.pipeline.JsonPipeline;
import com.geccocrawler.gecco.request.HttpGetRequest;
import com.geccocrawler.gecco.request.HttpRequest;
import com.nieyb.model.Article;
import com.nieyb.model.ArticleDTO;
import com.nieyb.model.Dictionary;
import com.nieyb.service.ArticleService;
import com.nieyb.service.DictionaryService;
import com.nieyb.utils.SpringContextHolder;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Component;
@Slf4j
@Component
@PipelineName(value = "newsContentPipelines")
public class NewsCountentPopeline extends JsonPipeline {
@Override
public void process(JSONObject newsContent) {
HttpRequest currRequest = HttpGetRequest.fromJson(newsContent.getJSONObject("request"));
String title = newsContent.getString("title");
String form = newsContent.getString("form");
String news = newsContent.getString("news");
}
}
}
上面的三个类就是全部代码了。但是由于业务处理类相当于是一个工具类,在将爬取到的数据放到数据库时注入service并没有成功。大家可以参考一下我的下一篇博客"工具类中引用注解"。不注入,直接扫描进来。
下一篇: Spring的事务(转)