欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

java爬虫框架webmagic学习笔记

程序员文章站 2022-04-07 18:05:32
一、前言最近毕设要用到爬虫逻辑,感觉用python学习周期会比较长,所以直接去网上找了一些爬虫逻辑,先记录下来留到以后忘记了可以直接用。二、框架简介webmagic的是一个无须配置、便于二次开发的爬虫框架,它提供简单灵活的API,只需少量代码即可实现一个爬虫。webmagic采用完全模块化的设计,功能覆盖整个爬虫的生命周期(链接提取、页面下载、内容抽取、持久化),支持多线程抓取,分布式抓取,并支持自动重试、自定义UA/cookie等功能。ps:自己上手弄了一个是真的很方便,而且很适合新手小白。不多介...

一、前言

最近毕设要用到爬虫逻辑,感觉用python学习周期会比较长,所以直接去网上找了一些爬虫逻辑,先记录下来留到以后忘记了可以直接用。

二、框架简介

webmagic的是一个无须配置、便于二次开发的爬虫框架,它提供简单灵活的API,只需少量代码即可实现一个爬虫。webmagic采用完全模块化的设计,功能覆盖整个爬虫的生命周期(链接提取、页面下载、内容抽取、持久化),支持多线程抓取,分布式抓取,并支持自动重试、自定义UA/cookie等功能。ps:自己上手弄了一个是真的很方便,而且很适合新手小白。
不多介绍,自己去参考api:http://webmagic.io/
但是搜索了挺久网上的参考示例都很简单,只有基本的demo比较忧伤,并没有整合当前新技术的demo,无奈之下,自己动手丰衣足食。将新搭建的springboot+mybatis注解开发的新项目和它整合到了一起。
三、使用准备
pom.xml

   <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jdbc</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.mybatis.spring.boot</groupId>
            <artifactId>mybatis-spring-boot-starter</artifactId>
            <version>2.1.3</version>
        </dependency>
        <dependency>
            <groupId>org.mybatis.generator</groupId>
            <artifactId>mybatis-generator-core</artifactId>
            <version>1.3.6</version>
        </dependency>
        <!--爬虫程序-->
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-core</artifactId>
            <version>0.7.3</version>
        </dependency>
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-extension</artifactId>
            <version>0.7.3</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-devtools</artifactId>
            <scope>runtime</scope>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
            <exclusions>
                <exclusion>
                    <groupId>org.junit.vintage</groupId>
                    <artifactId>junit-vintage-engine</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.73</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <!-- mybatis-generator自动生成代码插件 -->
            <plugin>
                <groupId>org.mybatis.generator</groupId>
                <artifactId>mybatis-generator-maven-plugin</artifactId>
                <version>1.3.6</version>
            </plugin>
        </plugins>
    </build>

项目架构:一个典型的三层架构模型java爬虫框架webmagic学习笔记
内容提取类:实现PageProcess接口,用于编写页面逻辑

package com.fuyao.demo.traffic.Utils.config;

import java.util.Iterator;
import java.util.Map;
import java.util.Map.Entry;

import com.fuyao.demo.traffic.bean.News;
import com.fuyao.demo.traffic.controller.NewsController;
import com.fuyao.demo.traffic.mapper.NewsMapper;
import com.fuyao.demo.traffic.service.Impl.NewsServiceImpl;
import org.apache.ibatis.session.SqlSession;
import org.apache.ibatis.session.SqlSessionFactory;
import org.apache.ibatis.session.SqlSessionFactoryBuilder;

import org.springframework.util.StringUtils;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;
import us.codecraft.webmagic.processor.PageProcessor;

/**
 * @author code4crafter@gmail.com <br>
 */
public class sinaBlogProcessor implements PageProcessor {

    // 抓取网站的相关配置,包括编码、抓取间隔、重试次数等
    private Site site = Site.me().setRetryTimes(3).setSleepTime(100).setCharset("utf-8");

    public Site getSite() {
        return site;
    }

    public void process(Page page) {
        // 文章页,匹配 https://voice.hupu.com/nba/七位数字.html
        if (page.getUrl().regex("https://voice\\.hupu\\.com/nba/[0-9]{7}\\.html").match()) {
            page.putField("Title", page.getHtml().xpath("/html/body/div[4]/div[1]/div[1]/h1/text()").toString());
            page.putField("Content", page.getHtml().xpath("/html/body/div[4]/div[1]/div[2]/div/div[2]/p/text()").all().toString());
            page.putField("imgUrl",page.getHtml().xpath("//div[@class='artical-importantPic']").css("img", "src").toString());
            page.putField("Publish",page.getHtml().xpath("/html/body/div[4]/div[1]/div[1]/div[1]/span/a/span/text()").toString());
        }
        // 列表页
        else {
            // 文章url
            page.addTargetRequests(
                    page.getHtml().xpath("/html/body/div[3]/div[1]/div[2]/ul/li/div[1]/h4/a/@href").all());
            // 翻页url
            page.addTargetRequests(
                    page.getHtml().xpath("/html/body/div[3]/div[1]/div[3]/a[@class='page-btn-prev']/@href").all());
        }
    }

//    public static void main(String[] args) {
//        Spider.create(new sinaBlogProcessor()).addUrl("https://voice.hupu.com/nba/1").addPipeline(new NewsController())
//                .thread(3).run();
//    }
}

内容保存类:实现Pipeline 接口持久化数据至数据库

package com.fuyao.demo.traffic.Utils.config;

import com.fuyao.demo.traffic.bean.News;
import com.fuyao.demo.traffic.service.NewsService;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;

import java.util.Iterator;
import java.util.Map;

/**
 * @description:
 * @author: fuyao
 * @time: 2020/11/12 10:36
 */
public class MysqlPipeline implements Pipeline {
   private NewsService newsService;

    public NewsService getNewsService() {
        return newsService;
    }

    public void setNewsService(NewsService newsService) {
        this.newsService = newsService;
    }

    @Override
    public void process(ResultItems resultItems, Task task) {
        Map<String, Object> mapResults = resultItems.getAll();
        Iterator<Map.Entry<String, Object>> iter = mapResults.entrySet().iterator();
        Map.Entry<String, Object> entry;
        // 输出到控制台
        while (iter.hasNext()) {
            entry = iter.next();
            System.out.println(entry.getKey() + ":" + entry.getValue());
        }
        // 持久化
        News news = new News();
        if (!mapResults.get("Title").equals("")) {
            System.out.println("222222");
            news.setTitle((String) mapResults.get("Title"));
            news.setContent((String) mapResults.get("Content"));
            news.setImg_url((String) mapResults.get("imgUrl"));
            news.setPunish_time((String) mapResults.get("Publish"));
            System.out.println(news.toString());
        }
        try {
            newsService.add(news);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

因为想整合到之前写的框架里面,所以把启动逻辑放到了controller层,也可以放到定时程序中跑

package com.fuyao.demo.traffic.controller;

import com.fuyao.demo.traffic.Utils.config.MysqlPipeline;
import com.fuyao.demo.traffic.Utils.config.sinaBlogProcessor;
import com.fuyao.demo.traffic.bean.News;
import com.fuyao.demo.traffic.service.CityService;
import com.fuyao.demo.traffic.service.NewsService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;

import java.util.Iterator;
import java.util.Map;

/**
 * @description:
 * @author: fuyao
 * @time: 2020/11/12 10:18
 */
@RestController
@RequestMapping("/News")
public class NewsController{
    @Autowired
    private NewsService service;
    @Autowired
    private CityService cityService;
    @GetMapping("/start")
    public void start(){
        MysqlPipeline mysqlPipeline = new MysqlPipeline();
        mysqlPipeline.setNewsService(service);
        Spider.create(new sinaBlogProcessor()).addUrl("https://voice.hupu.com/nba/1").addPipeline(mysqlPipeline)
                .thread(3).run();
    }


}

application.properties文件也贴一下

server.servlet.context-path=/traffic
#server.port=8989
#使用了dao层框架,需要配置数据源
spring.datasource.username=登录用户名
spring.datasource.password=登录密码
spring.datasource.url=jdbc:mysql://localhost:3306/traffic?useUnicode=true&characterEncoding=utf-8&useSSL=true&serverTimezone=UTC
spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver
#mybatis.mapper-locations= classpath:mapper/*.xml

然后就可以跑了,这是跑完以后的截图。

java爬虫框架webmagic学习笔记

本文地址:https://blog.csdn.net/qq_45036013/article/details/109643456