利用spring boot 写一个稳定的爬虫
1、前言
这篇文章是利用spring boot 写一个稳定的爬虫,爬取的网页数据包含未执行js的网页数据、http/https接口的请求数据、和经过网页渲染的js数据(需要chorme浏览器),数据库使用mysql,程序的运行逻辑定去抓取网页数据,解析数据,存入mysql数据库中,爬取百度股市通的数据为例。
2、创建项目
使用idea开发,首先创建一个spring boot 项目,Group设置为com.crawler,Artifact设置为example,创建项目如图1所示
勾选web模块
设置项目名称为example
3、爬取的数据和存储的数据表结构
1、爬虫百度股市通 https://gupiao.baidu.com/concept/ 上面的概要数据
2、获取某一只股票的今日价格数据
概要数据爬取的内容包含热点概念,驱动事件和具体的股票数据,概要数据的数据抽象成msql表如下:
CREATE TABLE `baidu_hot` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`title_line1` varchar(255) DEFAULT NULL,
`title_line2` int(11) DEFAULT NULL,
`title_line3` varchar(255) DEFAULT NULL,
`title_line4` varchar(255) DEFAULT NULL,
`dirver_thing` text,
`hot_stock_name_1` varchar(255) DEFAULT NULL,
`hot_stock_code_1` varchar(11) DEFAULT NULL,
`hot_stock_price_1` double DEFAULT NULL,
`hot_stock_increment_1` varchar(20) DEFAULT NULL,
`hot_stock_name_2` varchar(255) DEFAULT NULL,
`hot_stock_code_2` varchar(11) DEFAULT NULL,
`hot_stock_price_2` double DEFAULT NULL,
`hot_stock_increment_2` varchar(20) DEFAULT NULL,
`hot_stock_name_3` varchar(255) DEFAULT NULL,
`hot_stock_code_3` varchar(11) NOT NULL,
`hot_stock_price_3` double DEFAULT NULL,
`hot_stock_increment_3` varchar(20) DEFAULT NULL,
`insert_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8;
百度股市的接口数据直接存为json数据,其表抽象如下
CREATE TABLE `stock` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`stock_id` varchar(30) DEFAULT NULL,
`data` json DEFAULT NULL,
`insert_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8;
数据库名字取为:
baidugushi
4、spring boot 项目配置
4.1、日志配置
日志使用logback,每天生成一个info和error级别的日志,logback配置文件如下,配置文件名称为 logback-spring.xml ,logback配置文件放到resources目录下面即可生效:
<?xml version="1.0" encoding="utf-8"?>
<configuration>
<appender name="consoleLog" class="ch.qos.logback.core.ConsoleAppender">
<layout class="ch.qos.logback.classic.PatternLayout">
<pattern>%d - %msg%n</pattern>
</layout>
</appender>
<appender name="fileInfoLog" class="ch.qos.logback.core.rolling.RollingFileAppender">
<filter class="ch.qos.logback.classic.filter.LevelFilter">
<level>ERROR</level>
<onMatch>DENY</onMatch>
<onMismatch>ACCEPT</onMismatch>
</filter>
<encoder>
<pattern>%msg%n</pattern>
</encoder>
<!--滚动策略-->
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<!--路径-->
<!-- winwods -->
<fileNamePattern>D:\logs\example.info.%d.log</fileNamePattern>
<!--<fileNamePattern>/data/log/crawler.info.%d.log</fileNamePattern>-->
</rollingPolicy>
</appender>
<appender name="fileErrorLog" class="ch.qos.logback.core.rolling.RollingFileAppender">
<filter class="ch.qos.logback.classic.filter.ThresholdFilter">
<level>ERROR</level>
</filter>
<encoder>
<pattern>%msg%n</pattern>
</encoder>
<!--滚动策略-->
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<!--路径-->
<fileNamePattern>D:\logs\example.error.%d.log</fileNamePattern>
<!--<fileNamePattern>/data/log/crawler.error.%d.log</fileNamePattern>-->
</rollingPolicy>
</appender>
<root level="info">
<appender-ref ref="consoleLog"/>
<appender-ref ref="fileInfoLog"/>
<appender-ref ref="fileErrorLog"/>
</root>
</configuration>
4.2、mysql配置
首先在项目的example包底下创建driver,entity,map,web包,如下图所示
mysql使用mybaits去连接数据库
配置之前首先需要在maven导入的包如下:
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>6.0.6</version>
</dependency>
<dependency>
<groupId>org.mybatis.spring.boot</groupId>
<artifactId>mybatis-spring-boot-starter</artifactId>
<version>1.1.1</version>
</dependency>
application.properties配置文件的内容如下:
mybatis.type-aliases-package=com.crawler.example.entity
spring.datasource.driverClassName = com.mysql.cj.jdbc.Driver
#本机调试
spring.datasource.url = jdbc:mysql://127.0.0.1:3306/baidugushi?useUnicode=true&characterEncoding=UTF-8&useSSL=false&autoReconnect=true&serverTimezone=UTC
spring.datasource.username = root
spring.datasource.password = pwd
其中mybatis.type-aliases-package 是指定mybaits的数据库中表对应类的包,用户名和密码请自行修改
4.3、tomcat 设置
项目最终发布的形式是一个war包,jar包不方便部署和管理。
修改项目的pom.xml文件,packagin修改为war
<packaging>war</packaging>
删除自带的tomcat,和添加必要的依赖
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
<exclusions>
<exclusion>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-tomcat</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>javax.servlet</groupId>
<artifactId>javax.servlet-api</artifactId>
<version>3.1.0</version>
<scope>provided</scope>
</dependency>
修改完pom.xml文件还需要添加有一个启动类,其中ExampleApplication是spring boot自动生成的启动类
package com.crawler.example;
import org.springframework.boot.builder.SpringApplicationBuilder;
import org.springframework.boot.web.support.SpringBootServletInitializer;
public class SpringBootStartApplication extends SpringBootServletInitializer {
@Override
protected SpringApplicationBuilder configure(SpringApplicationBuilder builder) {
// 注意这里要指向原先用main方法执行的Application启动类
return builder.sources(ExampleApplication.class);
}
}
4.4、idea项目调试设置
首先电脑上要安装了tomcat,随后打开Idea 的Run –Run… edit configurations 删除掉spring boot自带启动设置,随后新建tomcat的配置,tomcat配置如下面两张图所示,将After Launchq勾线去掉,否则项目启动时会启动浏览器。
上图主要需要配置的是要选择tomcat的位置
上图是配置启动时部署的war包
5、抓取静态页面数据
5.1、定时任务编写
爬虫是定时抓取网页数据,html页面解析使用Jsoup,jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。
需要增加的pom依赖为
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.3</version>
</dependency>
页面下载程序为:
package com.crawler.example.web;
import com.crawler.example.dirver.BaiDuHotProcess;
import com.crawler.example.entity.BaiduHot;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.crawler.example.map.BaiduHotMap;
import org.slf4j.Logger;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.web.bind.annotation.RestController;
import java.io.IOException;
import java.util.ArrayList;
//百度股市热门下载
@RestController
public class BaiduHotDown {
public static Logger logger;
@Autowired
BaiduHotMap baiduHotMap;
@Scheduled(cron = "0/20 * * * * ? ")
public void downBaiduHot(){
String url = "https://gupiao.baidu.com/concept/";
try {
Document doc = Jsoup.connect(url).get();
ArrayList<BaiduHot> abh = new BaiDuHotProcess().processBaiduHot(doc);
for(BaiduHot b:abh){
baiduHotMap.InsertBaiduHot(b);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
5.2、数据提取解析为:
package com.crawler.example.dirver;
import com.crawler.example.entity.BaiduHot;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.ArrayList;
public class BaiDuHotProcess {
public ArrayList<BaiduHot> processBaiduHot(Document doc){
ArrayList<BaiduHot> abh = new ArrayList<BaiduHot>();
//提取数据
Elements divsBig = doc.getElementsByClass("hot-concept clearfix");
for(int i=0;i<divsBig.size();i++){
BaiduHot baiduHot = new BaiduHot();
//获得行业数据
Elements cloumn1 = divsBig.get(i).getElementsByClass("concept-header column1");
//获取行数数据
baiduHot.title_line1 = cloumn1.get(0).getElementsByClass("text-ellipsis").get(0).ownText();
//获取热搜指数
baiduHot.title_line2 =Integer.parseInt( cloumn1.get(0).getElementsByTag("h3").get(0).getElementsByTag("span").get(0).ownText());
//获得发布时间
baiduHot.title_line3 = cloumn1.get(0).getElementsByTag("p").get(0).ownText();
//获得简要内容
baiduHot.title_line4 = cloumn1.get(0).getElementsByTag("p").get(1).ownText();
//概述内容
baiduHot.dirver_thing = divsBig.get(i).getElementsByClass("concept-event column3").get(0).ownText();
//获得推荐股价
Elements stockUl = divsBig.get(i).getElementsByClass("no-click");
//System.out.println(stockUl.get(0).getElementsByTag("a").get(0).getElementsByTag("div").size());
//股票1名称
baiduHot.hot_stock_name_1 = stockUl.get(0).getElementsByTag("a").get(0).getElementsByTag("div").get(0).ownText();
//股票1代码
baiduHot.hot_stock_code_1 = stockUl.get(0).getElementsByTag("a").get(0).getElementsByTag("div").get(1).ownText();
//股票1价格
baiduHot.hot_stock_price_1 = Double.parseDouble(stockUl.get(0).getElementsByClass("column2").get(1).ownText());
//股票1涨幅
baiduHot.hot_stock_increment_1 = stockUl.get(0).child(2).ownText();
//股票2名称
baiduHot.hot_stock_name_2 = stockUl.get(1).getElementsByTag("a").get(0).getElementsByTag("div").get(0).ownText();
//股票2代码
baiduHot.hot_stock_code_2 = stockUl.get(1).getElementsByTag("a").get(0).getElementsByTag("div").get(1).ownText();
//股票2价格
baiduHot.hot_stock_price_2 = Double.parseDouble(stockUl.get(1).getElementsByClass("column2").get(1).ownText());
//股票2涨幅
baiduHot.hot_stock_increment_2 = stockUl.get(1).child(2).ownText();
//股票3名称
baiduHot.hot_stock_name_3 = stockUl.get(2).getElementsByTag("a").get(0).getElementsByTag("div").get(0).ownText();
//股票3代码
baiduHot.hot_stock_code_3 = stockUl.get(2).getElementsByTag("a").get(0).getElementsByTag("div").get(1).ownText();
//股票3价格
baiduHot.hot_stock_price_3 = Double.parseDouble(stockUl.get(2).getElementsByClass("column2").get(1).ownText());
//股票3涨幅
baiduHot.hot_stock_increment_3 = stockUl.get(2).child(2).ownText();
abh.add(baiduHot);
}
return abh;
}
}
定时周期使用corn表达式实现,可以使用在线网址:http://cron.qqe2.com/,点点鼠标即可生成想要的定时周期
实体和数据库接口类如下
package com.crawler.example.map;
import com.crawler.example.entity.BaiduHot;
import org.apache.ibatis.annotations.Insert;
import org.apache.ibatis.annotations.Mapper;
//百度接口数据
@Mapper
public interface BaiduHotMap {
@Insert("insert into baidu_hot(title_line1,title_line2,title_line3,title_line4,dirver_thing,hot_stock_name_1," +
"hot_stock_code_1,hot_stock_price_1,hot_stock_increment_1,hot_stock_name_2,hot_stock_code_2,hot_stock_price_2," +
"hot_stock_increment_2,hot_stock_name_3,hot_stock_code_3,hot_stock_price_3,hot_stock_increment_3) values(" +
"#{title_line1},#{title_line2},#{title_line3},#{title_line4},#{dirver_thing},#{hot_stock_name_1},#{hot_stock_code_1}," +
"#{hot_stock_price_1},#{hot_stock_increment_1},#{hot_stock_name_2},#{hot_stock_code_2},#{hot_stock_price_2},#{hot_stock_increment_2}," +
"#{hot_stock_name_3},#{hot_stock_code_3},#{hot_stock_price_3},#{hot_stock_increment_3})")
public void InsertBaiduHot(BaiduHot baiduHot);
}
实体类:
package com.crawler.example.entity;
public class BaiduHot implements Cloneable {
public int id;
public String title_line1;
public int title_line2;
public String title_line3;
public String title_line4;
public String dirver_thing;
public String hot_stock_name_1;
public String hot_stock_code_1;
public double hot_stock_price_1;
public String hot_stock_increment_1;
public String hot_stock_name_2;
public String hot_stock_code_2;
public double hot_stock_price_2;
public String hot_stock_increment_2;
public String hot_stock_name_3;
public String hot_stock_code_3;
public double hot_stock_price_3;
public String hot_stock_increment_3;
}
启动类加上注解:@EnableScheduling
最后获得的数据如下图所示:
6、抓取http接口的数据
6.1、获取某只股票当日的数据
http下载程序为
<dependency>
<groupId>org.json</groupId>
<artifactId>json</artifactId>
<version>20160810</version>
</dependency>
程序源码如下,启动定时器sh60035下载当日的股票数据:
package com.crawler.example.web;
import com.crawler.example.dirver.GetJson;
import com.crawler.example.entity.StockPrice;
import com.crawler.example.map.StockPriceMap;
import org.json.JSONObject;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.web.bind.annotation.RestController;
//查询股票的价格
@RestController
public class BaiduStockPrice {
@Autowired
StockPriceMap stockPriceMap;
//下载股票曲线图
//@Scheduled(cron = "0/20 * * * * ? ")
public void downStockPrice(){
//url 生成
String url = "https://gupiao.baidu.com/api/stocks/stocktimeline?from=pc&os_ver=1&cuid=xxx&vv=100&format=json&stock_code=sh600358×tamp=" + System.currentTimeMillis();
//访问获得json数据
JSONObject stock = new GetJson().getHttpJson(url,1);
StockPrice stockPrice = new StockPrice();
stockPrice.stock_id = "sh60035";
stockPrice.data = stock.toString();
//将json数据存入数据库中
stockPriceMap.insertIntoStock(stockPrice);
}
}
http下载并返回json
package com.crawler.example.dirver;
import org.json.JSONObject;
import javax.net.ssl.HttpsURLConnection;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
public class GetJson {
public JSONObject getHttpJson(String url,int comefrom){
try {
URL realUrl = new URL(url);
HttpURLConnection connection = (HttpURLConnection)realUrl.openConnection();
connection.setRequestProperty("accept", "*/*");
connection.setRequestProperty("connection", "Keep-Alive");
connection.setRequestProperty("user-agent","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;SV1)");
// 建立实际的连接
connection.connect();
//请求成功
if(connection.getResponseCode()==200){
InputStream is=connection.getInputStream();
ByteArrayOutputStream baos=new ByteArrayOutputStream();
//10MB的缓存
byte [] buffer=new byte[10485760];
int len=0;
while((len=is.read(buffer))!=-1){
baos.write(buffer, 0, len);
}
String jsonString=baos.toString();
baos.close();
is.close();
//转换成json数据处理
JSONObject jsonArray=getJsonString(jsonString,comefrom);
return jsonArray;
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException ex) {
ex.printStackTrace();
}
return null;
}
public JSONObject getHttpsJson(String url){
try {
URL realUrl = new URL(url);
HttpsURLConnection httpsConn = (HttpsURLConnection)realUrl.openConnection();
httpsConn.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8");
httpsConn.setRequestProperty("connection", "Keep-Alive");
httpsConn.setRequestProperty("user-agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36");
httpsConn.setRequestProperty("Accept-Charset","utf-8");
httpsConn.setRequestProperty("contentType", "utf-8");
httpsConn.connect();
if(httpsConn.getResponseCode()==200){
InputStream is = httpsConn.getInputStream();
ByteArrayOutputStream baos=new ByteArrayOutputStream();
//10MB的缓存
byte [] buffer=new byte[10485760];
int len=0;
while((len=is.read(buffer))!=-1){
baos.write(buffer, 0, len);
}
String jsonString=baos.toString("utf-8");
baos.close();
is.close();
return new JSONObject(jsonString);
}
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
public JSONObject getJsonString(String str, int comefrom){
JSONObject jo = null;
if(comefrom==1){
return new JSONObject(str);
}else if(comefrom==2){
int indexStart = 0;
//字符处理
for(int i=0;i<str.length();i++){
if(str.charAt(i)=='('){
indexStart = i;
break;
}
}
String strNew = "";
//分割字符串
for(int i=indexStart+1;i<str.length()-1;i++){
strNew += str.charAt(i);
}
return new JSONObject(strNew);
}
return jo;
}
}
getHttpJson函数的后面的参数1,表示返回的是json数据,2表示http接口的数据在一个()中的数据
6.2、数据库实体类和数据库的程序如下
数据库实体类
package com.crawler.example.entity;
import org.json.JSONObject;
public class StockPrice {
public String stock_id;
public String data;
public String getStock_id() {
return stock_id;
}
public void setStock_id(String stock_id) {
this.stock_id = stock_id;
}
public String getData() {
return data;
}
public void setData(String data) {
this.data = data;
}
}
数据库接口如下:
package com.crawler.example.map;
import com.crawler.example.entity.StockPrice;
import org.apache.ibatis.annotations.Insert;
import org.apache.ibatis.annotations.Mapper;
@Mapper
public interface StockPriceMap {
@Insert("insert into stock(stock_id,data) values(#{stock_id},#{data})")
public void insertIntoStock(StockPrice stockPrice);
}
最后的运行结果为:
7、获取动态页面数据
7.1、抓取需要渲染的网页
有时网页的数据经过渲染才能有,或者网站有饭爬虫措施,那么可以借助浏览器下载网页数据,这样爬虫部署到windows server上比较方便,centos 安装chorme 比较困难,这种数据抓取方式几乎可以为所欲为0.0。
操作chorme的java插件有
cdp4j - Chrome DevTools Protocol for Java
链接地址为:
https://github.com/webfolderio/cdp4j
还是抓取百度股市热点网页数据
抓取程序为:
package com.crawler.example.web;
import com.crawler.example.dirver.BaiDuHotProcess;
import com.crawler.example.entity.BaiduHot;
import com.crawler.example.map.BaiduHotMap;
import io.webfolder.cdp.Launcher;
import io.webfolder.cdp.session.Session;
import io.webfolder.cdp.session.SessionFactory;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.web.bind.annotation.RestController;
import java.util.ArrayList;
@RestController
public class BaiduHotDownChorme {
@Autowired
BaiduHotMap baiduHotMap;
@Scheduled(cron = "0/20 * * * * ? ")
public void downBaiDuHot(){
ArrayList<String> command = new ArrayList<String>();
//不显示google 浏览器
command.add("--headless");
Launcher launcher = new Launcher();
try (SessionFactory factory = launcher.launch(command);
Session session = factory.create()){
session.navigate("https://gupiao.baidu.com/concept/");
session.waitDocumentReady();
String content = (String) session.getContent();
//System.out.println(content);
Document doc = Jsoup.parse(content);
ArrayList<BaiduHot> abh = new BaiDuHotProcess().processBaiduHot(doc);
for(BaiduHot b:abh){
baiduHotMap.InsertBaiduHot(b);
}
}catch (Exception e){
e.printStackTrace();
}
}
}
上面就是java 抓取网页数据的三种方式,项目源码在github上,源码地址为:
https://github.com/xiaoyangmoa/java-crawler