欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Gecco爬虫代码分析

程序员文章站 2022-05-05 12:29:28
...

[email protected]

###配置需要渲染的SpiderBean

@Gecco(matchUrl="https://github.com/{user}/{project}", pipelines="consolePipeline")
public class MyGithub implements HtmlBean {

	private static final long serialVersionUID = -7127412585200687225L;
	
	@RequestParameter("user")
	private String user;
	
	@RequestParameter("project")
	private String project;
	
	@HtmlField(cssPath=".repository-meta-content")
	private String title;
	
	@Text
	@HtmlField(cssPath=".pagehead-actions li:nth-child(2) .social-count")
	private int star;
	
	@Text
	@HtmlField(cssPath=".pagehead-actions li:nth-child(3) .social-count")
	private int fork;
	
	@HtmlField(cssPath=".entry-content")
	private String readme;

	public String getReadme() {
		return readme;
	}

	public void setReadme(String readme) {
		this.readme = readme;
	}

	public String getUser() {
		return user;
	}

	public void setUser(String user) {
		this.user = user;
	}

	public String getProject() {
		return project;
	}

	public void setProject(String project) {
		this.project = project;
	}

	public String getTitle() {
		return title;
	}

	public void setTitle(String title) {
		this.title = title;
	}

	public int getStar() {
		return star;
	}

	public void setStar(int star) {
		this.star = star;
	}

	public int getFork() {
		return fork;
	}

	public void setFork(int fork) {
		this.fork = fork;
	}
}

###启动爬虫引擎

public static void main(String[] args) {
	GeccoEngine.create()
	.classpath("com.geccocrawler.gecco.demo")
	//爬虫userAgent设置
	.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36")
	//开始抓取的页面地址
	.start("https://github.com/xtuhcy/gecco")
	//开启几个爬虫线程
	.thread(1)
	//单个爬虫每次抓取完一个请求后的间隔时间
	.interval(2000)
	.run();
}

##公共注解说明 ###@Gecco

定义一个SpiderBean必须有的注解,告诉爬虫引擎什么样的url转换成该java bean,使用什么渲染器渲染,java bean渲染完成后传递给哪些管道过滤器继续处理

相关标签: 爬虫