教您使用java爬虫gecco抓取JD全部商品信息（一）

程序员文章站 2022-05-23 22:24:14

...

教您使用java爬虫gecco抓取JD全部商品信息（一）

gecco爬虫

如果对gecco还没有了解可以参看一下gecco的github首页。gecco爬虫十分的简单易用，JD全部商品信息的抓取9个类就能搞定。

JD网站的分析

要抓取JD网站的全部商品信息，我们要先分析一下网站，京东网站可以大体分为三级，首页上通过分类跳转到商品列表页，商品列表页对每个商品有详情页。那么我们通过找到所有分类就能逐个分类抓取商品信息。

入口地址

http://www.jd.com/allSort.aspx，这个地址是JD全部商品的分类列表，我们以该页面作为开始页面，抓取JD的全部商品信息

新建开始页面的HtmlBean类AllSort

@Gecco(matchUrl="http://www.jd.com/allSort.aspx", pipelines={"consolePipeline", "allSortPipeline"})
public classAllSortimplementsHtmlBean{

    private static final long serialVersionUID = 665662335318691818L;

    @Request
    private HttpRequest request;

    //手机
    @HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(2) > div.mc > div.items > dl")
    private List<Category> mobile;

    //家用电器
    @HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(3) > div.mc > div.items > dl")
    private List<Category> domestic;

    public List<Category> getMobile(){
        return mobile;
    }

    publicvoidsetMobile(List<Category> mobile){
        this.mobile = mobile;
    }

    public List<Category> getDomestic(){
        return domestic;
    }

    publicvoidsetDomestic(List<Category> domestic){
        this.domestic = domestic;
    }

    public HttpRequest getRequest(){
        return request;
    }

    publicvoidsetRequest(HttpRequest request){
        this.request = request;
    }
}

可以看到，这里以抓取手机和家用电器两个大类的商品信息为例，可以看到每个大类都包含若干个子分类，用List<Category>表示。gecco支持Bean的嵌套，可以很好的表达html页面结构。Category表示子分类信息内容，HrefBean是共用的链接Bean。

public classCategoryimplementsHtmlBean{

    private static final long serialVersionUID = 3018760488621382659L;

    @Text
    @HtmlField(cssPath="dt a")
    private String parentName;

    @HtmlField(cssPath="dd a")
    private List<HrefBean> categorys;

    public String getParentName(){
        return parentName;
    }

    publicvoidsetParentName(String parentName){
        this.parentName = parentName;
    }

    public List<HrefBean> getCategorys(){
        return categorys;
    }

    publicvoidsetCategorys(List<HrefBean> categorys){
        this.categorys = categorys;
    }

}

获取页面元素cssPath的小技巧

上面两个类难点就在cssPath的获取上，这里介绍一些cssPath获取的小技巧。用Chrome浏览器打开需要抓取的网页，按F12进入发者模式。选择你要获取的元素，如图：

在浏览器右侧选中该元素，鼠标右键选择Copy--Copy selector，即可获得该元素的cssPath

body > div:nth-child(5) > div.main-classify > div.list > div.category-items.clearfix > div:nth-child(1) > div:nth-child(2) > div.mc > div.items

如果你对jquery的selector有了解，另外我们只希望获得dl元素，因此即可简化为：

.category-items > div:nth-child(1) > div:nth-child(2) > div.mc > div.items > dl

编写AllSort的业务处理类

完成对AllSort的注入后，我们需要对AllSort进行业务处理，这里我们不做分类信息持久化等处理，只对分类链接进行提取，进一步抓取商品列表信息。看代码：

@PipelineName("allSortPipeline")
public classAllSortPipelineimplementsPipeline<AllSort> {

    @Override
    public void process(AllSort allSort) {
        List<Category> categorys = allSort.getMobile();
        for(Category category : categorys) {
            List<HrefBean> hrefs = category.getCategorys();
            for(HrefBean href : hrefs) {
                String url = href.getUrl()+"&delivery=1&page=1&JL=4_10_0&go=0";
                HttpRequest currRequest = allSort.getRequest();
                SchedulerContext.into(currRequest.subRequest(url));
            }
        }
    }

}

@PipelinName定义该pipeline的名称，在AllSort的@Gecco注解里进行关联，这样，gecco在抓取完并注入Bean后就会逐个调用@Gecco定义的pipeline了。为每个子链接增加"&delivery=1&page=1&JL=4_10_0&go=0"的目的是只抓取京东自营并且有货的商品。SchedulerContext.into()方法是将待抓取的链接放入队列中等待进一步抓取。

相关标签： java 爬虫 gecco 京东

上一篇： OSWorkflow标签详解

下一篇： ipconfig的使用

教您使用java爬虫gecco抓取JD全部商品信息（一）

教您使用java爬虫gecco抓取JD全部商品信息（一）

gecco爬虫

JD网站的分析

入口地址

新建开始页面的HtmlBean类AllSort

获取页面元素cssPath的小技巧

编写AllSort的业务处理类

教您使用java爬虫gecco抓取JD全部商品信息（一）

教您使用java爬虫gecco抓取JD全部商品信息（三）

教您使用java爬虫gecco抓取JD全部商品信息（二）

教您使用DynamicGecco抓取JD全部商品信息

教您使用java爬虫gecco抓取JD全部商品信息（三）

教您使用java爬虫gecco抓取JD全部商品信息（一）

教您使用java爬虫gecco抓取JD全部商品信息（二）

教您使用DynamicGecco抓取JD全部商品信息

教您使用java爬虫gecco抓取JD全部商品信息（二）

教您使用DynamicGecco抓取JD全部商品信息