springboot+WebMagic+MyBatis爬虫框架的使用
程序员文章站
2022-03-08 15:27:39
目录6.crawlermapper.xml文件7.知乎页面内容处理类zhihupageprocessor8.知乎数据处理类zhihupipeline9.知乎爬虫任务类zhihutask10.sprin...
webmagic是一个开源的java爬虫框架。webmagic框架的使用并不是本文的重点,具体如何使用请参考官方文档:。
本文是对spring boot+webmagic+mybatis做了整合,使用webmagic爬取数据,然后通过mybatis持久化爬取的数据到mysql数据库。本文提供的源代码可以作为java爬虫项目的脚手架。
1.添加maven依赖
<?xml version="1.0" encoding="utf-8"?> <project xmlns="http://maven.apache.org/pom/4.0.0" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://maven.apache.org/pom/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelversion>4.0.0</modelversion> <groupid>hyzx</groupid> <artifactid>qbasic-crawler</artifactid> <version>1.0.0</version> <parent> <groupid>org.springframework.boot</groupid> <artifactid>spring-boot-starter-parent</artifactid> <version>1.5.21.release</version> <relativepath/> <!-- lookup parent from repository --> </parent> <properties> <project.build.sourceencoding>utf-8</project.build.sourceencoding> <maven.test.skip>true</maven.test.skip> <java.version>1.8</java.version> <maven.compiler.plugin.version>3.8.1</maven.compiler.plugin.version> <maven.resources.plugin.version>3.1.0</maven.resources.plugin.version> <mysql.connector.version>5.1.47</mysql.connector.version> <druid.spring.boot.starter.version>1.1.17</druid.spring.boot.starter.version> <mybatis.spring.boot.starter.version>1.3.4</mybatis.spring.boot.starter.version> <fastjson.version>1.2.58</fastjson.version> <commons.lang3.version>3.9</commons.lang3.version> <joda.time.version>2.10.2</joda.time.version> <webmagic.core.version>0.7.3</webmagic.core.version> </properties> <dependencies> <dependency> <groupid>org.springframework.boot</groupid> <artifactid>spring-boot-devtools</artifactid> <scope>runtime</scope> <optional>true</optional> </dependency> <dependency> <groupid>org.springframework.boot</groupid> <artifactid>spring-boot-starter-test</artifactid> <scope>test</scope> </dependency> <dependency> <groupid>org.springframework.boot</groupid> <artifactid>spring-boot-configuration-processor</artifactid> <optional>true</optional> </dependency> <dependency> <groupid>mysql</groupid> <artifactid>mysql-connector-java</artifactid> <version>${mysql.connector.version}</version> </dependency> <dependency> <groupid>com.alibaba</groupid> <artifactid>druid-spring-boot-starter</artifactid> <version>${druid.spring.boot.starter.version}</version> </dependency> <dependency> <groupid>org.mybatis.spring.boot</groupid> <artifactid>mybatis-spring-boot-starter</artifactid> <version>${mybatis.spring.boot.starter.version}</version> </dependency> <dependency> <groupid>com.alibaba</groupid> <artifactid>fastjson</artifactid> <version>${fastjson.version}</version> </dependency> <dependency> <groupid>org.apache.commons</groupid> <artifactid>commons-lang3</artifactid> <version>${commons.lang3.version}</version> </dependency> <dependency> <groupid>joda-time</groupid> <artifactid>joda-time</artifactid> <version>${joda.time.version}</version> </dependency> <dependency> <groupid>us.codecraft</groupid> <artifactid>webmagic-core</artifactid> <version>${webmagic.core.version}</version> <exclusions> <exclusion> <groupid>org.slf4j</groupid> <artifactid>slf4j-log4j12</artifactid> </exclusion> </exclusions> </dependency> </dependencies> <build> <plugins> <plugin> <groupid>org.apache.maven.plugins</groupid> <artifactid>maven-compiler-plugin</artifactid> <version>${maven.compiler.plugin.version}</version> <configuration> <source>${java.version}</source> <target>${java.version}</target> <encoding>${project.build.sourceencoding}</encoding> </configuration> </plugin> <plugin> <groupid>org.apache.maven.plugins</groupid> <artifactid>maven-resources-plugin</artifactid> <version>${maven.resources.plugin.version}</version> <configuration> <encoding>${project.build.sourceencoding}</encoding> </configuration> </plugin> <plugin> <groupid>org.springframework.boot</groupid> <artifactid>spring-boot-maven-plugin</artifactid> <configuration> <fork>true</fork> <addresources>true</addresources> </configuration> <executions> <execution> <goals> <goal>repackage</goal> </goals> </execution> </executions> </plugin> </plugins> </build> <repositories> <repository> <id>public</id> <name>aliyun nexus</name> <url>http://maven.aliyun.com/nexus/content/groups/public/</url> <releases> <enabled>true</enabled> </releases> </repository> </repositories> <pluginrepositories> <pluginrepository> <id>public</id> <name>aliyun nexus</name> <url>http://maven.aliyun.com/nexus/content/groups/public/</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </pluginrepository> </pluginrepositories> </project>
2.项目配置文件 application.properties
配置mysql数据源,druid数据库连接池以及mybatis的mapper文件的位置。
# mysql数据源配置 spring.datasource.name=mysql spring.datasource.type=com.alibaba.druid.pool.druiddatasource spring.datasource.driver-class-name=com.mysql.jdbc.driver spring.datasource.url=jdbc:mysql://192.168.0.63:3306/gjhzjl?useunicode=true&characterencoding=utf8&usessl=false&allowmultiqueries=true spring.datasource.username=root spring.datasource.password=root # druid数据库连接池配置 spring.datasource.druid.initial-size=5 spring.datasource.druid.min-idle=5 spring.datasource.druid.max-active=10 spring.datasource.druid.max-wait=60000 spring.datasource.druid.validation-query=select 1 from dual spring.datasource.druid.test-on-borrow=false spring.datasource.druid.test-on-return=false spring.datasource.druid.test-while-idle=true spring.datasource.druid.time-between-eviction-runs-millis=60000 spring.datasource.druid.min-evictable-idle-time-millis=300000 spring.datasource.druid.max-evictable-idle-time-millis=600000 # mybatis配置 mybatis.mapperlocations=classpath:mapper/**/*.xml
3.数据库表结构
create table `cms_content` ( `contentid` varchar(40) not null comment '内容id', `title` varchar(150) not null comment '标题', `content` longtext comment '文章内容', `releasedate` datetime not null comment '发布日期', primary key (`contentid`) ) engine=innodb default charset=utf8 comment='cms内容表';
4.实体类
import java.util.date; public class cmscontentpo { private string contentid; private string title; private string content; private date releasedate; public string getcontentid() { return contentid; } public void setcontentid(string contentid) { this.contentid = contentid; } public string gettitle() { return title; } public void settitle(string title) { this.title = title; } public string getcontent() { return content; } public void setcontent(string content) { this.content = content; } public date getreleasedate() { return releasedate; } public void setreleasedate(date releasedate) { this.releasedate = releasedate; } }
5.mapper接口
public interface crawlermapper { int addcmscontent(cmscontentpo record); }
6.crawlermapper.xml文件
<?xml version="1.0" encoding="utf-8"?> <!doctype mapper public "-//mybatis.org//dtd mapper 3.0//en" "http://mybatis.org/dtd/mybatis-3-mapper.dtd"> <mapper namespace="com.hyzx.qbasic.dao.crawlermapper"> <insert id="addcmscontent" parametertype="com.hyzx.qbasic.model.cmscontentpo"> insert into cms_content (contentid, title, releasedate, content) values (#{contentid,jdbctype=varchar}, #{title,jdbctype=varchar}, #{releasedate,jdbctype=timestamp}, #{content,jdbctype=longvarchar}) </insert> </mapper>
7.知乎页面内容处理类zhihupageprocessor
主要用于解析爬取到的知乎html页面。
@component public class zhihupageprocessor implements pageprocessor { private site site = site.me().setretrytimes(3).setsleeptime(1000); @override public void process(page page) { page.addtargetrequests(page.gethtml().links().regex("https://www\\.zhihu\\.com/question/\\d+/answer/\\d+.*").all()); page.putfield("title", page.gethtml().xpath("//h1[@class='questionheader-title']/text()").tostring()); page.putfield("answer", page.gethtml().xpath("//div[@class='questionanswer-content']/tidytext()").tostring()); if (page.getresultitems().get("title") == null) { // 如果是列表页,跳过此页,pipeline不进行后续处理 page.setskip(true); } } @override public site getsite() { return site; } }
8.知乎数据处理类zhihupipeline
主要用于将知乎html页面解析出的数据存储到mysql数据库。
@component public class zhihupipeline implements pipeline { private static final logger logger = loggerfactory.getlogger(zhihupipeline.class); @autowired private crawlermapper crawlermapper; public void process(resultitems resultitems, task task) { string title = resultitems.get("title"); string answer = resultitems.get("answer"); cmscontentpo contentpo = new cmscontentpo(); contentpo.setcontentid(uuid.randomuuid().tostring()); contentpo.settitle(title); contentpo.setreleasedate(new date()); contentpo.setcontent(answer); try { boolean success = crawlermapper.addcmscontent(contentpo) > 0; logger.info("保存知乎文章成功:{}", title); } catch (exception ex) { logger.error("保存知乎文章失败", ex); } } }
9.知乎爬虫任务类zhihutask
每十分钟启动一次爬虫。
@component public class zhihutask { private static final logger logger = loggerfactory.getlogger(zhihupipeline.class); @autowired private zhihupipeline zhihupipeline; @autowired private zhihupageprocessor zhihupageprocessor; private scheduledexecutorservice timer = executors.newsinglethreadscheduledexecutor(); public void crawl() { // 定时任务,每10分钟爬取一次 timer.schedulewithfixeddelay(() -> { thread.currentthread().setname("zhihucrawlerthread"); try { spider.create(zhihupageprocessor) // 从https://www.zhihu.com/explore开始抓 .addurl("https://www.zhihu.com/explore") // 抓取到的数据存数据库 .addpipeline(zhihupipeline) // 开启2个线程抓取 .thread(2) // 异步启动爬虫 .start(); } catch (exception ex) { logger.error("定时抓取知乎数据线程执行异常", ex); } }, 0, 10, timeunit.minutes); } }
10.spring boot程序启动类
@springbootapplication @mapperscan(basepackages = "com.hyzx.qbasic.dao") public class application implements commandlinerunner { @autowired private zhihutask zhihutask; public static void main(string[] args) throws ioexception { springapplication.run(application.class, args); } @override public void run(string... strings) throws exception { // 爬取知乎数据 zhihutask.crawl(); } }
到此这篇关于springboot+webmagic+mybatis爬虫框架的使用的文章就介绍到这了,更多相关springboot+webmagic+mybatis爬虫内容请搜索以前的文章或继续浏览下面的相关文章希望大家以后多多支持!
推荐阅读
-
在Yii框架中使用PHP模板引擎Twig的例子
-
Python的Flask框架中使用Flask-Migrate扩展迁移数据库的教程
-
Python的Flask框架中使用Flask-SQLAlchemy管理数据库的教程
-
使用beaker让Facebook的Bottle框架支持session功能
-
如何使用Python爬虫获取offcn上的公考信息及写入Excel表格并发送至指定邮箱
-
YII2框架中ActiveDataProvider与GridView的配合使用操作示例
-
YII2框架使用控制台命令的方法分析
-
YII2框架中验证码的简单使用方法示例
-
YII2框架中behavior行为的理解与使用方法示例
-
YII2框架中日志的配置与使用方法实例分析