欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

利用spring boot 写一个稳定的爬虫

程序员文章站 2022-03-09 22:55:33
...

1、前言

这篇文章是利用spring boot 写一个稳定的爬虫,爬取的网页数据包含未执行js的网页数据、http/https接口的请求数据、和经过网页渲染的js数据(需要chorme浏览器),数据库使用mysql,程序的运行逻辑定去抓取网页数据,解析数据,存入mysql数据库中,爬取百度股市通的数据为例。

2、创建项目

使用idea开发,首先创建一个spring boot 项目,Group设置为com.crawler,Artifact设置为example,创建项目如图1所示
利用spring boot 写一个稳定的爬虫

勾选web模块

利用spring boot 写一个稳定的爬虫
设置项目名称为example
利用spring boot 写一个稳定的爬虫

3、爬取的数据和存储的数据表结构

1、爬虫百度股市通 https://gupiao.baidu.com/concept/ 上面的概要数据
2、获取某一只股票的今日价格数据

https://gupiao.baidu.com/api/stocks/stocktimeline?from=pc&os_ver=1&cuid=xxx&vv=100&format=json&stock_code=sh600358&timestamp=

概要数据爬取的内容包含热点概念,驱动事件和具体的股票数据,概要数据的数据抽象成msql表如下:

CREATE TABLE `baidu_hot` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `title_line1` varchar(255) DEFAULT NULL,
  `title_line2` int(11) DEFAULT NULL,
  `title_line3` varchar(255) DEFAULT NULL,
  `title_line4` varchar(255) DEFAULT NULL,
  `dirver_thing` text,
  `hot_stock_name_1` varchar(255) DEFAULT NULL,
  `hot_stock_code_1` varchar(11) DEFAULT NULL,
  `hot_stock_price_1` double DEFAULT NULL,
  `hot_stock_increment_1` varchar(20) DEFAULT NULL,
  `hot_stock_name_2` varchar(255) DEFAULT NULL,
  `hot_stock_code_2` varchar(11) DEFAULT NULL,
  `hot_stock_price_2` double DEFAULT NULL,
  `hot_stock_increment_2` varchar(20) DEFAULT NULL,
  `hot_stock_name_3` varchar(255) DEFAULT NULL,
  `hot_stock_code_3` varchar(11) NOT NULL,
  `hot_stock_price_3` double DEFAULT NULL,
  `hot_stock_increment_3` varchar(20) DEFAULT NULL,
  `insert_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8;

百度股市的接口数据直接存为json数据,其表抽象如下

CREATE TABLE `stock` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `stock_id` varchar(30) DEFAULT NULL,
  `data` json DEFAULT NULL,
  `insert_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8;

数据库名字取为:
baidugushi

4、spring boot 项目配置

4.1、日志配置
日志使用logback,每天生成一个info和error级别的日志,logback配置文件如下,配置文件名称为 logback-spring.xml ,logback配置文件放到resources目录下面即可生效:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
    <appender name="consoleLog" class="ch.qos.logback.core.ConsoleAppender">
        <layout class="ch.qos.logback.classic.PatternLayout">
            <pattern>%d - %msg%n</pattern>
        </layout>
    </appender>
    <appender name="fileInfoLog" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <filter class="ch.qos.logback.classic.filter.LevelFilter">
            <level>ERROR</level>
            <onMatch>DENY</onMatch>
            <onMismatch>ACCEPT</onMismatch>
        </filter>
        <encoder>
            <pattern>%msg%n</pattern>
        </encoder>
        <!--滚动策略-->
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <!--路径-->
            <!-- winwods -->
            <fileNamePattern>D:\logs\example.info.%d.log</fileNamePattern>
            <!--<fileNamePattern>/data/log/crawler.info.%d.log</fileNamePattern>-->
        </rollingPolicy>
    </appender>
    <appender name="fileErrorLog" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
            <level>ERROR</level>
        </filter>
        <encoder>
            <pattern>%msg%n</pattern>
        </encoder>
        <!--滚动策略-->
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <!--路径-->
            <fileNamePattern>D:\logs\example.error.%d.log</fileNamePattern>
            <!--<fileNamePattern>/data/log/crawler.error.%d.log</fileNamePattern>-->
        </rollingPolicy>
    </appender>
    <root level="info">
        <appender-ref ref="consoleLog"/>
        <appender-ref ref="fileInfoLog"/>
        <appender-ref ref="fileErrorLog"/>
    </root>
</configuration>

4.2、mysql配置
首先在项目的example包底下创建driver,entity,map,web包,如下图所示
利用spring boot 写一个稳定的爬虫
mysql使用mybaits去连接数据库
配置之前首先需要在maven导入的包如下:

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>6.0.6</version>
        </dependency>

        <dependency>
            <groupId>org.mybatis.spring.boot</groupId>
            <artifactId>mybatis-spring-boot-starter</artifactId>
            <version>1.1.1</version>
        </dependency>

application.properties配置文件的内容如下:

mybatis.type-aliases-package=com.crawler.example.entity
spring.datasource.driverClassName = com.mysql.cj.jdbc.Driver
#本机调试
spring.datasource.url = jdbc:mysql://127.0.0.1:3306/baidugushi?useUnicode=true&characterEncoding=UTF-8&useSSL=false&autoReconnect=true&serverTimezone=UTC
spring.datasource.username = root
spring.datasource.password = pwd

其中mybatis.type-aliases-package 是指定mybaits的数据库中表对应类的包,用户名和密码请自行修改
4.3、tomcat 设置
项目最终发布的形式是一个war包,jar包不方便部署和管理。
修改项目的pom.xml文件,packagin修改为war

    <packaging>war</packaging>

删除自带的tomcat,和添加必要的依赖

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
            <exclusions>
                <exclusion>
                    <groupId>org.springframework.boot</groupId>
                    <artifactId>spring-boot-starter-tomcat</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>javax.servlet</groupId>
            <artifactId>javax.servlet-api</artifactId>
            <version>3.1.0</version>
            <scope>provided</scope>
        </dependency>

修改完pom.xml文件还需要添加有一个启动类,其中ExampleApplication是spring boot自动生成的启动类

package com.crawler.example;

import org.springframework.boot.builder.SpringApplicationBuilder;
import org.springframework.boot.web.support.SpringBootServletInitializer;

public class SpringBootStartApplication extends SpringBootServletInitializer {
    @Override
    protected SpringApplicationBuilder configure(SpringApplicationBuilder builder) {
        // 注意这里要指向原先用main方法执行的Application启动类
        return builder.sources(ExampleApplication.class);
    }
}

4.4、idea项目调试设置
首先电脑上要安装了tomcat,随后打开Idea 的Run –Run… edit configurations 删除掉spring boot自带启动设置,随后新建tomcat的配置,tomcat配置如下面两张图所示,将After Launchq勾线去掉,否则项目启动时会启动浏览器。
利用spring boot 写一个稳定的爬虫
上图主要需要配置的是要选择tomcat的位置
利用spring boot 写一个稳定的爬虫
上图是配置启动时部署的war包

5、抓取静态页面数据

5.1、定时任务编写
爬虫是定时抓取网页数据,html页面解析使用Jsoup,jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。
需要增加的pom依赖为

        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.3</version>
        </dependency>

页面下载程序为:

package com.crawler.example.web;

import com.crawler.example.dirver.BaiDuHotProcess;
import com.crawler.example.entity.BaiduHot;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.crawler.example.map.BaiduHotMap;
import org.slf4j.Logger;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.web.bind.annotation.RestController;
import java.io.IOException;
import java.util.ArrayList;

//百度股市热门下载
@RestController
public class BaiduHotDown {
    public  static Logger logger;
    @Autowired
    BaiduHotMap baiduHotMap;

    @Scheduled(cron = "0/20 * * * * ? ")
    public void downBaiduHot(){
        String url = "https://gupiao.baidu.com/concept/";
        try {
            Document doc = Jsoup.connect(url).get();
            ArrayList<BaiduHot> abh = new BaiDuHotProcess().processBaiduHot(doc);
            for(BaiduHot b:abh){
                baiduHotMap.InsertBaiduHot(b);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

5.2、数据提取解析为:

package com.crawler.example.dirver;

import com.crawler.example.entity.BaiduHot;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.util.ArrayList;

public class BaiDuHotProcess {

    public ArrayList<BaiduHot> processBaiduHot(Document doc){

        ArrayList<BaiduHot> abh = new ArrayList<BaiduHot>();
        //提取数据
        Elements divsBig = doc.getElementsByClass("hot-concept clearfix");
        for(int i=0;i<divsBig.size();i++){
            BaiduHot baiduHot = new BaiduHot();
            //获得行业数据
            Elements cloumn1 = divsBig.get(i).getElementsByClass("concept-header column1");
            //获取行数数据
            baiduHot.title_line1 = cloumn1.get(0).getElementsByClass("text-ellipsis").get(0).ownText();
            //获取热搜指数
            baiduHot.title_line2 =Integer.parseInt( cloumn1.get(0).getElementsByTag("h3").get(0).getElementsByTag("span").get(0).ownText());
            //获得发布时间
            baiduHot.title_line3 = cloumn1.get(0).getElementsByTag("p").get(0).ownText();
            //获得简要内容
            baiduHot.title_line4 = cloumn1.get(0).getElementsByTag("p").get(1).ownText();
            //概述内容
            baiduHot.dirver_thing = divsBig.get(i).getElementsByClass("concept-event column3").get(0).ownText();
            //获得推荐股价
            Elements stockUl = divsBig.get(i).getElementsByClass("no-click");
            //System.out.println(stockUl.get(0).getElementsByTag("a").get(0).getElementsByTag("div").size());
            //股票1名称
            baiduHot.hot_stock_name_1 = stockUl.get(0).getElementsByTag("a").get(0).getElementsByTag("div").get(0).ownText();
            //股票1代码
            baiduHot.hot_stock_code_1 = stockUl.get(0).getElementsByTag("a").get(0).getElementsByTag("div").get(1).ownText();
            //股票1价格
            baiduHot.hot_stock_price_1 = Double.parseDouble(stockUl.get(0).getElementsByClass("column2").get(1).ownText());
            //股票1涨幅

            baiduHot.hot_stock_increment_1 = stockUl.get(0).child(2).ownText();

            //股票2名称
            baiduHot.hot_stock_name_2 = stockUl.get(1).getElementsByTag("a").get(0).getElementsByTag("div").get(0).ownText();
            //股票2代码
            baiduHot.hot_stock_code_2 = stockUl.get(1).getElementsByTag("a").get(0).getElementsByTag("div").get(1).ownText();
            //股票2价格
            baiduHot.hot_stock_price_2 = Double.parseDouble(stockUl.get(1).getElementsByClass("column2").get(1).ownText());
            //股票2涨幅
            baiduHot.hot_stock_increment_2 = stockUl.get(1).child(2).ownText();

            //股票3名称
            baiduHot.hot_stock_name_3 = stockUl.get(2).getElementsByTag("a").get(0).getElementsByTag("div").get(0).ownText();
            //股票3代码
            baiduHot.hot_stock_code_3 = stockUl.get(2).getElementsByTag("a").get(0).getElementsByTag("div").get(1).ownText();
            //股票3价格
            baiduHot.hot_stock_price_3 = Double.parseDouble(stockUl.get(2).getElementsByClass("column2").get(1).ownText());
            //股票3涨幅
            baiduHot.hot_stock_increment_3 = stockUl.get(2).child(2).ownText();

            abh.add(baiduHot);
        }
        return  abh;
    }
}

定时周期使用corn表达式实现,可以使用在线网址:http://cron.qqe2.com/,点点鼠标即可生成想要的定时周期
实体和数据库接口类如下

package com.crawler.example.map;

import com.crawler.example.entity.BaiduHot;
import org.apache.ibatis.annotations.Insert;
import org.apache.ibatis.annotations.Mapper;

//百度接口数据
@Mapper
public interface BaiduHotMap {
    @Insert("insert into baidu_hot(title_line1,title_line2,title_line3,title_line4,dirver_thing,hot_stock_name_1," +
            "hot_stock_code_1,hot_stock_price_1,hot_stock_increment_1,hot_stock_name_2,hot_stock_code_2,hot_stock_price_2," +
            "hot_stock_increment_2,hot_stock_name_3,hot_stock_code_3,hot_stock_price_3,hot_stock_increment_3) values(" +
            "#{title_line1},#{title_line2},#{title_line3},#{title_line4},#{dirver_thing},#{hot_stock_name_1},#{hot_stock_code_1}," +
            "#{hot_stock_price_1},#{hot_stock_increment_1},#{hot_stock_name_2},#{hot_stock_code_2},#{hot_stock_price_2},#{hot_stock_increment_2}," +
            "#{hot_stock_name_3},#{hot_stock_code_3},#{hot_stock_price_3},#{hot_stock_increment_3})")
    public void InsertBaiduHot(BaiduHot baiduHot);
}

实体类:

package com.crawler.example.entity;

public class BaiduHot implements Cloneable {

    public int id;

    public String title_line1;

    public int title_line2;

    public String title_line3;

    public String title_line4;

    public String dirver_thing;

    public String hot_stock_name_1;

    public String hot_stock_code_1;

    public double hot_stock_price_1;

    public String hot_stock_increment_1;

    public String hot_stock_name_2;

    public String hot_stock_code_2;

    public double hot_stock_price_2;

    public String hot_stock_increment_2;

    public String hot_stock_name_3;

    public String hot_stock_code_3;

    public double hot_stock_price_3;

    public String hot_stock_increment_3;
  }

启动类加上注解:@EnableScheduling
最后获得的数据如下图所示:
利用spring boot 写一个稳定的爬虫

6、抓取http接口的数据

6.1、获取某只股票当日的数据
http下载程序为

        <dependency>
            <groupId>org.json</groupId>
            <artifactId>json</artifactId>
            <version>20160810</version>
        </dependency>

程序源码如下,启动定时器sh60035下载当日的股票数据:

package com.crawler.example.web;

import com.crawler.example.dirver.GetJson;
import com.crawler.example.entity.StockPrice;
import com.crawler.example.map.StockPriceMap;
import org.json.JSONObject;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.web.bind.annotation.RestController;

//查询股票的价格
@RestController
public class BaiduStockPrice {

    @Autowired
    StockPriceMap stockPriceMap;

    //下载股票曲线图
    //@Scheduled(cron = "0/20 * * * * ? ")
    public void downStockPrice(){
        //url 生成
        String url = "https://gupiao.baidu.com/api/stocks/stocktimeline?from=pc&os_ver=1&cuid=xxx&vv=100&format=json&stock_code=sh600358&timestamp=" + System.currentTimeMillis();
        //访问获得json数据
        JSONObject stock = new GetJson().getHttpJson(url,1);
        StockPrice stockPrice = new StockPrice();
        stockPrice.stock_id = "sh60035";
        stockPrice.data = stock.toString();
        //将json数据存入数据库中
        stockPriceMap.insertIntoStock(stockPrice);
    }
}

http下载并返回json

package com.crawler.example.dirver;

import org.json.JSONObject;

import javax.net.ssl.HttpsURLConnection;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;

public class GetJson {
    public JSONObject getHttpJson(String url,int comefrom){
        try {
            URL realUrl = new URL(url);
            HttpURLConnection connection = (HttpURLConnection)realUrl.openConnection();
            connection.setRequestProperty("accept", "*/*");
            connection.setRequestProperty("connection", "Keep-Alive");
            connection.setRequestProperty("user-agent","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;SV1)");
            // 建立实际的连接
            connection.connect();
            //请求成功
            if(connection.getResponseCode()==200){
                InputStream is=connection.getInputStream();
                ByteArrayOutputStream baos=new ByteArrayOutputStream();
                //10MB的缓存
                byte [] buffer=new byte[10485760];
                int len=0;
                while((len=is.read(buffer))!=-1){
                    baos.write(buffer, 0, len);
                }
                String jsonString=baos.toString();
                baos.close();
                is.close();
                //转换成json数据处理
                JSONObject jsonArray=getJsonString(jsonString,comefrom);
                return jsonArray;
            }
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException ex) {
           ex.printStackTrace();
        }
        return null;
    }

    public JSONObject getHttpsJson(String url){
        try {
            URL realUrl = new URL(url);
            HttpsURLConnection httpsConn = (HttpsURLConnection)realUrl.openConnection();
            httpsConn.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8");
            httpsConn.setRequestProperty("connection", "Keep-Alive");
            httpsConn.setRequestProperty("user-agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36");
            httpsConn.setRequestProperty("Accept-Charset","utf-8");
            httpsConn.setRequestProperty("contentType", "utf-8");
            httpsConn.connect();
            if(httpsConn.getResponseCode()==200){
                InputStream is = httpsConn.getInputStream();
                ByteArrayOutputStream baos=new ByteArrayOutputStream();
                //10MB的缓存
                byte [] buffer=new byte[10485760];
                int len=0;
                while((len=is.read(buffer))!=-1){
                    baos.write(buffer, 0, len);
                }
                String jsonString=baos.toString("utf-8");
                baos.close();
                is.close();
                return new JSONObject(jsonString);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        return  null;
    }

    public JSONObject getJsonString(String str, int comefrom){
           JSONObject jo = null;
            if(comefrom==1){
                return new JSONObject(str);
            }else if(comefrom==2){
                int indexStart = 0;
                //字符处理
                for(int i=0;i<str.length();i++){
                    if(str.charAt(i)=='('){
                        indexStart = i;
                        break;
                    }
                }
                String strNew = "";
                //分割字符串
                for(int i=indexStart+1;i<str.length()-1;i++){
                    strNew += str.charAt(i);
                }
                return new JSONObject(strNew);
            }
           return jo;
    }
}

getHttpJson函数的后面的参数1,表示返回的是json数据,2表示http接口的数据在一个()中的数据
6.2、数据库实体类和数据库的程序如下
数据库实体类

package com.crawler.example.entity;

import org.json.JSONObject;

public class StockPrice {

    public  String stock_id;

    public String data;

    public String getStock_id() {
        return stock_id;
    }

    public void setStock_id(String stock_id) {
        this.stock_id = stock_id;
    }

    public String getData() {
        return data;
    }

    public void setData(String data) {
        this.data = data;
    }
}

数据库接口如下:

package com.crawler.example.map;

import com.crawler.example.entity.StockPrice;
import org.apache.ibatis.annotations.Insert;
import org.apache.ibatis.annotations.Mapper;

@Mapper
public interface StockPriceMap {

    @Insert("insert into stock(stock_id,data) values(#{stock_id},#{data})")
    public void insertIntoStock(StockPrice stockPrice);

}

最后的运行结果为:
利用spring boot 写一个稳定的爬虫

7、获取动态页面数据

7.1、抓取需要渲染的网页

有时网页的数据经过渲染才能有,或者网站有饭爬虫措施,那么可以借助浏览器下载网页数据,这样爬虫部署到windows server上比较方便,centos 安装chorme 比较困难,这种数据抓取方式几乎可以为所欲为0.0。
操作chorme的java插件有

cdp4j - Chrome DevTools Protocol for Java
链接地址为:
https://github.com/webfolderio/cdp4j

还是抓取百度股市热点网页数据
抓取程序为:

package com.crawler.example.web;

import com.crawler.example.dirver.BaiDuHotProcess;
import com.crawler.example.entity.BaiduHot;
import com.crawler.example.map.BaiduHotMap;
import io.webfolder.cdp.Launcher;
import io.webfolder.cdp.session.Session;
import io.webfolder.cdp.session.SessionFactory;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.web.bind.annotation.RestController;

import java.util.ArrayList;

@RestController
public class BaiduHotDownChorme {

    @Autowired

    BaiduHotMap baiduHotMap;
    @Scheduled(cron = "0/20 * * * * ? ")
    public void downBaiDuHot(){
        ArrayList<String> command = new ArrayList<String>();
        //不显示google 浏览器
        command.add("--headless");
        Launcher launcher = new Launcher();
        try (SessionFactory factory = launcher.launch(command);
             Session session = factory.create()){
            session.navigate("https://gupiao.baidu.com/concept/");
            session.waitDocumentReady();
            String content = (String) session.getContent();
            //System.out.println(content);
            Document doc = Jsoup.parse(content);
            ArrayList<BaiduHot> abh = new BaiDuHotProcess().processBaiduHot(doc);
            for(BaiduHot b:abh){
                baiduHotMap.InsertBaiduHot(b);
            }
        }catch (Exception e){
            e.printStackTrace();
        }

    }
}

上面就是java 抓取网页数据的三种方式,项目源码在github上,源码地址为:
https://github.com/xiaoyangmoa/java-crawler