java爬虫框架webmagic学习笔记
程序员文章站
2022-04-07 18:05:32
一、前言最近毕设要用到爬虫逻辑,感觉用python学习周期会比较长,所以直接去网上找了一些爬虫逻辑,先记录下来留到以后忘记了可以直接用。二、框架简介webmagic的是一个无须配置、便于二次开发的爬虫框架,它提供简单灵活的API,只需少量代码即可实现一个爬虫。webmagic采用完全模块化的设计,功能覆盖整个爬虫的生命周期(链接提取、页面下载、内容抽取、持久化),支持多线程抓取,分布式抓取,并支持自动重试、自定义UA/cookie等功能。ps:自己上手弄了一个是真的很方便,而且很适合新手小白。不多介...
一、前言
最近毕设要用到爬虫逻辑,感觉用python学习周期会比较长,所以直接去网上找了一些爬虫逻辑,先记录下来留到以后忘记了可以直接用。
二、框架简介
webmagic的是一个无须配置、便于二次开发的爬虫框架,它提供简单灵活的API,只需少量代码即可实现一个爬虫。webmagic采用完全模块化的设计,功能覆盖整个爬虫的生命周期(链接提取、页面下载、内容抽取、持久化),支持多线程抓取,分布式抓取,并支持自动重试、自定义UA/cookie等功能。ps:自己上手弄了一个是真的很方便,而且很适合新手小白。
不多介绍,自己去参考api:http://webmagic.io/
但是搜索了挺久网上的参考示例都很简单,只有基本的demo比较忧伤,并没有整合当前新技术的demo,无奈之下,自己动手丰衣足食。将新搭建的springboot+mybatis注解开发的新项目和它整合到了一起。
三、使用准备
pom.xml
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jdbc</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.mybatis.spring.boot</groupId>
<artifactId>mybatis-spring-boot-starter</artifactId>
<version>2.1.3</version>
</dependency>
<dependency>
<groupId>org.mybatis.generator</groupId>
<artifactId>mybatis-generator-core</artifactId>
<version>1.3.6</version>
</dependency>
<!--爬虫程序-->
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.7.3</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.7.3</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-devtools</artifactId>
<scope>runtime</scope>
<optional>true</optional>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
<exclusions>
<exclusion>
<groupId>org.junit.vintage</groupId>
<artifactId>junit-vintage-engine</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.73</version>
</dependency>
</dependencies>
<build>
<plugins>
<!-- mybatis-generator自动生成代码插件 -->
<plugin>
<groupId>org.mybatis.generator</groupId>
<artifactId>mybatis-generator-maven-plugin</artifactId>
<version>1.3.6</version>
</plugin>
</plugins>
</build>
项目架构:一个典型的三层架构模型
内容提取类:实现PageProcess接口,用于编写页面逻辑
package com.fuyao.demo.traffic.Utils.config;
import java.util.Iterator;
import java.util.Map;
import java.util.Map.Entry;
import com.fuyao.demo.traffic.bean.News;
import com.fuyao.demo.traffic.controller.NewsController;
import com.fuyao.demo.traffic.mapper.NewsMapper;
import com.fuyao.demo.traffic.service.Impl.NewsServiceImpl;
import org.apache.ibatis.session.SqlSession;
import org.apache.ibatis.session.SqlSessionFactory;
import org.apache.ibatis.session.SqlSessionFactoryBuilder;
import org.springframework.util.StringUtils;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;
import us.codecraft.webmagic.processor.PageProcessor;
/**
* @author code4crafter@gmail.com <br>
*/
public class sinaBlogProcessor implements PageProcessor {
// 抓取网站的相关配置,包括编码、抓取间隔、重试次数等
private Site site = Site.me().setRetryTimes(3).setSleepTime(100).setCharset("utf-8");
public Site getSite() {
return site;
}
public void process(Page page) {
// 文章页,匹配 https://voice.hupu.com/nba/七位数字.html
if (page.getUrl().regex("https://voice\\.hupu\\.com/nba/[0-9]{7}\\.html").match()) {
page.putField("Title", page.getHtml().xpath("/html/body/div[4]/div[1]/div[1]/h1/text()").toString());
page.putField("Content", page.getHtml().xpath("/html/body/div[4]/div[1]/div[2]/div/div[2]/p/text()").all().toString());
page.putField("imgUrl",page.getHtml().xpath("//div[@class='artical-importantPic']").css("img", "src").toString());
page.putField("Publish",page.getHtml().xpath("/html/body/div[4]/div[1]/div[1]/div[1]/span/a/span/text()").toString());
}
// 列表页
else {
// 文章url
page.addTargetRequests(
page.getHtml().xpath("/html/body/div[3]/div[1]/div[2]/ul/li/div[1]/h4/a/@href").all());
// 翻页url
page.addTargetRequests(
page.getHtml().xpath("/html/body/div[3]/div[1]/div[3]/a[@class='page-btn-prev']/@href").all());
}
}
// public static void main(String[] args) {
// Spider.create(new sinaBlogProcessor()).addUrl("https://voice.hupu.com/nba/1").addPipeline(new NewsController())
// .thread(3).run();
// }
}
内容保存类:实现Pipeline 接口持久化数据至数据库
package com.fuyao.demo.traffic.Utils.config;
import com.fuyao.demo.traffic.bean.News;
import com.fuyao.demo.traffic.service.NewsService;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;
import java.util.Iterator;
import java.util.Map;
/**
* @description:
* @author: fuyao
* @time: 2020/11/12 10:36
*/
public class MysqlPipeline implements Pipeline {
private NewsService newsService;
public NewsService getNewsService() {
return newsService;
}
public void setNewsService(NewsService newsService) {
this.newsService = newsService;
}
@Override
public void process(ResultItems resultItems, Task task) {
Map<String, Object> mapResults = resultItems.getAll();
Iterator<Map.Entry<String, Object>> iter = mapResults.entrySet().iterator();
Map.Entry<String, Object> entry;
// 输出到控制台
while (iter.hasNext()) {
entry = iter.next();
System.out.println(entry.getKey() + ":" + entry.getValue());
}
// 持久化
News news = new News();
if (!mapResults.get("Title").equals("")) {
System.out.println("222222");
news.setTitle((String) mapResults.get("Title"));
news.setContent((String) mapResults.get("Content"));
news.setImg_url((String) mapResults.get("imgUrl"));
news.setPunish_time((String) mapResults.get("Publish"));
System.out.println(news.toString());
}
try {
newsService.add(news);
} catch (Exception e) {
e.printStackTrace();
}
}
}
因为想整合到之前写的框架里面,所以把启动逻辑放到了controller层,也可以放到定时程序中跑
package com.fuyao.demo.traffic.controller;
import com.fuyao.demo.traffic.Utils.config.MysqlPipeline;
import com.fuyao.demo.traffic.Utils.config.sinaBlogProcessor;
import com.fuyao.demo.traffic.bean.News;
import com.fuyao.demo.traffic.service.CityService;
import com.fuyao.demo.traffic.service.NewsService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;
import java.util.Iterator;
import java.util.Map;
/**
* @description:
* @author: fuyao
* @time: 2020/11/12 10:18
*/
@RestController
@RequestMapping("/News")
public class NewsController{
@Autowired
private NewsService service;
@Autowired
private CityService cityService;
@GetMapping("/start")
public void start(){
MysqlPipeline mysqlPipeline = new MysqlPipeline();
mysqlPipeline.setNewsService(service);
Spider.create(new sinaBlogProcessor()).addUrl("https://voice.hupu.com/nba/1").addPipeline(mysqlPipeline)
.thread(3).run();
}
}
application.properties文件也贴一下
server.servlet.context-path=/traffic
#server.port=8989
#使用了dao层框架,需要配置数据源
spring.datasource.username=登录用户名
spring.datasource.password=登录密码
spring.datasource.url=jdbc:mysql://localhost:3306/traffic?useUnicode=true&characterEncoding=utf-8&useSSL=true&serverTimezone=UTC
spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver
#mybatis.mapper-locations= classpath:mapper/*.xml
然后就可以跑了,这是跑完以后的截图。
本文地址:https://blog.csdn.net/qq_45036013/article/details/109643456
推荐阅读
-
java 学习笔记(入门篇)_程序流程控制结构和方法
-
java 学习笔记(入门篇)_java程序helloWorld
-
java 学习笔记(入门篇)_java的安装与配置
-
Java基础学习笔记之数组详解
-
laravel框架学习笔记之组件化开发实现方法
-
Java学习笔记 DbUtils数据库查询和log4j日志输出 使用
-
java学习笔记(基础篇)—关键字static和final
-
Intellij IDEA 2019 + Java Spring MVC + Hibernate学习笔记(1)
-
【JAVA 学习笔记2】if使用例子
-
Spring框架学习笔记(6)——阿里云服务器部署Spring Boot项目(jar包)