详细Java批量获取微信公众号方法

程序员文章站 2023-12-04 11:46:04

最近需要爬取微信公众号的文章信息。在网上找了找发现微信公众号爬取的难点在于公众号文章链接在pc端是打不开的，要用微信的自带浏览器（拿到微信客户端补充的参数，才可以在其它平台...

最近需要爬取微信公众号的文章信息。在网上找了找发现微信公众号爬取的难点在于公众号文章链接在pc端是打不开的，要用微信的自带浏览器（拿到微信客户端补充的参数，才可以在其它平台打开），这就给爬虫程序造成很大困扰。后来在知乎上看到了一位大牛用php写的微信公众号爬取程序，就直接按大佬的思路整了整搞成java的了。改造途中遇到蛮多细节问题，拿出来分享一下。

系统的基本思路是在安卓模拟器上运行微信，模拟器设置代理，通过代理服务器拦截微信数据，将得到的数据发送给自己的程序进行处理。

需要准备的环境：nodejs，anyproxy代理，安卓模拟器

nodejs下载地址：http://nodejs.cn/download/，我下载的是windows版的，下好直接安装就行。安装好后，直接运行c:\program files\nodejs\npm.cmd 会自动配置好环境。

anyproxy安装：按上一步安装好nodejs之后，直接在cmd运行 npm install -g anyproxy 就会安装了

安卓模拟器随便在网上下一个就好了，一大堆。

首先为代理服务器安装证书，anyproxy默认不解析https链接，安装证书后就可以解析了，在cmd执行anyproxy --root 就会安装证书，之后还得在模拟器也下载这个证书。

然后输入anyproxy -i 命令打开代理服务。（记得加上参数！）

详细Java批量获取微信公众号方法

记住这个ip和端口，之后安卓模拟器的代理就用这个。现在用浏览器打开网页：http://localhost:8002/ 这是anyproxy的网页界面，用于显示http传输数据。

详细Java批量获取微信公众号方法

点击上面红框框里面的菜单，会出一个二维码，用安卓模拟器扫码识别，模拟器（手机）就会下载证书了，安装上就好了。

现在准备为模拟器设置代理，代理方式设置为手动，代理ip为运行anyproxy机器的ip，端口是8001

详细Java批量获取微信公众号方法

到这里准备工作基本完成，在模拟器上打开微信随便打开一个公众号的文章，就能从你刚打开的web界面中看到anyproxy抓取到的数据：

详细Java批量获取微信公众号方法

上面红框内就是微信文章的链接，点击进去可以看到具体的数据。如果response body里面什么都没有可能证书安装有问题。

如果上面都走通了，就可以接着往下走了。

这里我们靠代理服务抓微信数据，但总不能抓取一条数据就自己操作一下微信，那样还不如直接人工复制。所以我们需要微信客户端自己跳转页面。这时就可以使用anyproxy拦截微信服务器返回的数据，往里面注入页面跳转代码，再把加工的数据返回给模拟器实现微信客户端自动跳转。

打开anyproxy中的一个叫rule_default.js的js文件，windows下该文件在：c:\users\administrator\appdata\roaming\npm\node_modules\anyproxy\lib

在文件里面有个叫replaceserverresdataasync: function(req,res,serverresdata,callback)的方法，这个方法就是负责对anyproxy拿到的数据进行各种操作。一开始应该只有callback(serverresdata)；这条语句的意思是直接返回服务器响应数据给客户端。直接删掉这条语句，替换成大牛写的如下代码。这里的代码我并没有做什么改动，里面的注释也解释的给非常清楚，直接按逻辑看懂就行，问题不大。

 replaceserverresdataasync: function(req,res,serverresdata,callback){
     if(/mp\/getmasssendmsg/i.test(req.url)){//当链接地址为公众号历史消息页面时(第一种页面形式)
       //console.log("开始第一种页面爬取");
       if(serverresdata.tostring() !== ""){
         6         try {//防止报错退出程序
          var reg = /msglist = (.*?);/;//定义历史消息正则匹配规则
          var ret = reg.exec(serverresdata.tostring());//转换变量为string
          httppost(ret[1],req.url,"/internetspider/getdata/showbiz");//这个函数是后文定义的，将匹配到的历史消息json发送到自己的服务器
          var http = require('http');
           http.get('http://xxx/getwxhis', function(res) {//这个地址是自己服务器上的一个程序，目的是为了获取到下一个链接地址，将地址放在一个js脚本中，将页面自动跳转到下一页。后文将介绍getwxhis.php的原理。
             res.on('data', function(chunk){
             callback(chunk+serverresdata);//将返回的代码插入到历史消息页面中，并返回显示出来
             })
           });
         }catch(e){//如果上面的正则没有匹配到，那么这个页面内容可能是公众号历史消息页面向下翻动的第二页，因为历史消息第一页是html格式的，第二页就是json格式的。
         //console.log("开始第一种页面爬取向下翻形式");
           try {
             var json = json.parse(serverresdata.tostring());
             if (json.general_msg_list != []) {
             httppost(json.general_msg_list,req.url,"/xxx/showbiz");//这个函数和上面的一样是后文定义的，将第二页历史消息的json发送到自己的服务器
             }
           }catch(e){
            console.log(e);//错误捕捉
           }
           callback(serverresdata);//直接返回第二页json内容
         }
       }
       //console.log("开始第一种页面爬取 结束");
     }else if(/mp\/profile_ext\?action=home/i.test(req.url)){//当链接地址为公众号历史消息页面时(第二种页面形式)
       try {
         var reg = /var msglist = \'(.*?)\';/;//定义历史消息正则匹配规则（和第一种页面形式的正则不同）
         var ret = reg.exec(serverresdata.tostring());//转换变量为string
         httppost(ret[1],req.url,"/xxx/showbiz");//这个函数是后文定义的，将匹配到的历史消息json发送到自己的服务器
         var http = require('http');
         http.get('xxx/getwxhis', function(res) {//这个地址是自己服务器上的一个程序，目的是为了获取到下一个链接地址，将地址放在一个js脚本中，将页面自动跳转到下一页。后文将介绍getwxhis.php的原理。
             res.on('data', function(chunk){
             callback(chunk+serverresdata);//将返回的代码插入到历史消息页面中，并返回显示出来
             })
           });
       }catch(e){
         //console.log(e);
         callback(serverresdata);
       }
     }else if(/mp\/profile_ext\?action=getmsg/i.test(req.url)){//第二种页面表现形式的向下翻页后的json
       try {
         var json = json.parse(serverresdata.tostring());
         if (json.general_msg_list != []) {
           httppost(json.general_msg_list,req.url,"/xxx/showbiz");//这个函数和上面的一样是后文定义的，将第二页历史消息的json发送到自己的服务器
         }
       }catch(e){
         console.log(e);
       }
       callback(serverresdata);
     }else if(/mp\/getappmsgext/i.test(req.url)){//当链接地址为公众号文章阅读量和点赞量时
       try {
         httppost(serverresdata,req.url,"/xxx/getmsgext");//函数是后文定义的，功能是将文章阅读量点赞量的json发送到服务器
       }catch(e){
 
       }
       callback(serverresdata);
     }else if(/s\?__biz/i.test(req.url) || /mp\/rumor/i.test(req.url)){//当链接地址为公众号文章时（rumor这个地址是公众号文章被辟谣了）
       try {
         var http = require('http');
         http.get('http://xxx/getwxpost', function(res) {//这个地址是自己服务器上的另一个程序，目的是为了获取到下一个链接地址，将地址放在一个js脚本中，将页面自动跳转到下一页。后文将介绍getwxpost.php的原理。
           res.on('data', function(chunk){
             callback(chunk+serverresdata);
           })
         });
       }catch(e){
         callback(serverresdata);
       }
     }else{
       callback(serverresdata);
     }
     //callback(serverresdata);
   },

这里简单解释一下，微信公众号的历史消息页链接有两种形式：一种以 mp.weixin.qq.com/mp/getmasssendmsg 开头，另一种是 mp.weixin.qq.com/mp/profile_ext 开头。历史页是可以向下翻的，如果向下翻将触发js事件发送请求得到json数据（下一页内容）。还有公众号文章链接，以及文章的阅读量和点赞量的链接（返回的是json数据），这几种链接的形式是固定的可以通过逻辑判断来区分。这里有个问题就是历史页如果需要全部爬取到该怎么做到。我的思路是通过js去模拟鼠标向下滑动，从而触发提交加载下一部分列表的请求。或者直接利用anyproxy分析下滑加载的请求，直接向微信服务器发生这个请求。但都有一个问题就是如何判断已经没有余下数据了。我是爬取最新数据，暂时没这个需求，可能以后要。如果有需求的可以尝试一下。

下图是上文中的httppost方法内容。

 function httppost(str,url,path) {//将json发送到服务器，str为json内容，url为历史消息页面地址，path是接收程序的路径和文件名
     console.log("开始执行转发操作");
     try{
     var http = require('http');
     var data = {
         str: encodeuricomponent(str),
         url: encodeuricomponent(url)
     };
     data = require('querystring').stringify(data);
     var options = {
         method: "post",
         host: "xxx",//注意没有http://，这是服务器的域名。
         port: xxx,
         path: path,//接收程序的路径和文件名
         headers: {
             'content-type': 'application/x-www-form-urlencoded; charset=utf-8',
             "content-length": data.length
         }
     };
     var req = http.request(options, function (res) {
         res.setencoding('utf8');
         res.on('data', function (chunk) {
             console.log('body: ' + chunk);
         });
     });
     req.on('error', function (e) {
         console.log('problem with request: ' + e.message);
     });
     
     req.write(data);
     req.end();
     }catch(e){
         console.log("错误信息："+e);
     }
     console.log("转发操作结束");
 }

做完以上工作，接下来就是按自己业务来完成服务端代码了，我们的服务用于接收代理服务器发过来的数据进行处理，进行持久化操作，同时向代理服务器发送需要注入到微信的js代码。针对代理服务器拦截到的几种不同链接发来的数据，我们就需要设计相应的方法来处理这些数据。从anyproxy处理微信数据的js方法replaceserverresdataasync: function(req,res,serverresdata,callback)中，我们可以知道至少需要对公众号历史页数据、公众号文章页数据、公众号文章点赞量和阅读量数据设计三种方法来处理。同时我们还需要设计一个方法来生成爬取任务，完成公众号的轮寻爬取。如果需要爬取更多数据，可以从anyproxy抓取到的链接中分析出更多需要的数据，然后往replaceserverresdataasync: function(req,res,serverresdata,callback)中添加判定，拦截到需要的数据发送到自己的服务器，相应的在服务端添加方法处理该类数据就行了。

我是用java写的服务端代码。

处理公众号历史页数据方法：

public void getmsgjson(string str ,string url) throws unsupportedencodingexception {
    // todo auto-generated method stub
    string biz = "";
    map<string,string> querystrs = httpurlparser.parseurl(url);
    if(querystrs != null){
      biz = querystrs.get("__biz");
      biz = biz + "==";
    }
    /**
     * 从数据库中查询biz是否已经存在，如果不存在则插入，
     * 这代表着我们新添加了一个采集目标公众号。
     */
    list<weixin> results = weixinmapper.selectbybiz(biz);
    if(results == null || results.size() == 0){
      weixin weixin = new weixin();
      weixin.setbiz(biz);
      weixin.setcollect(system.currenttimemillis());
      weixinmapper.insert(weixin);
    }
    //system.out.println(str);
    //解析str变量
    list<object> lists = jsonpath.read(str, "['list']");
    for(object list : lists){
      object json = list;
      int type = jsonpath.read(json, "['comm_msg_info']['type']");
      if(type == 49){//type=49表示是图文消息
        string content_url = jsonpath.read(json, "$.app_msg_ext_info.content_url");
        content_url = content_url.replace("\\", "").replaceall("amp;", "");//获得图文消息的链接地址
        int is_multi = jsonpath.read(json, "$.app_msg_ext_info.is_multi");//是否是多图文消息
        integer datetime = jsonpath.read(json, "$.comm_msg_info.datetime");//图文消息发送时间
        /**
         * 在这里将图文消息链接地址插入到采集队列库tmplist中
         * （队列库将在后文介绍，主要目的是建立一个批量采集队列，
         * 另一个程序将根据队列安排下一个采集的公众号或者文章内容）
         */
        try{
          if(content_url != null && !"".equals(content_url)){
            tmplist tmplist = new tmplist();
            tmplist.setcontenturl(content_url);
            tmplistmapper.insertselective(tmplist);
          }
        }catch(exception e){
          system.out.println("队列已存在,不插入！");
        }
        
        /**
         * 在这里根据$content_url从数据库post中判断一下是否重复
         */
        list<post> postlist = postmapper.selectbycontenturl(content_url);
        boolean contenturlexist = false;
        if(postlist != null && postlist.size() != 0){
          contenturlexist = true;
        }
      
        
        if(!contenturlexist){//'数据库post中不存在相同的$content_url'
          integer fileid = jsonpath.read(json, "$.app_msg_ext_info.fileid");//一个微信给的id
          string title = jsonpath.read(json, "$.app_msg_ext_info.title");//文章标题
          string title_encode = urlencoder.encode(title, "utf-8");
          string digest = jsonpath.read(json, "$.app_msg_ext_info.digest");//文章摘要
          string source_url = jsonpath.read(json, "$.app_msg_ext_info.source_url");//阅读原文的链接
          source_url = source_url.replace("\\", "");
          string cover = jsonpath.read(json, "$.app_msg_ext_info.cover");//封面图片
          cover = cover.replace("\\", "");
          /**
           * 存入数据库
           */
//          system.out.println("头条标题："+title);
//          system.out.println("微信id："+fileid);
//          system.out.println("文章摘要:"+digest);
//          system.out.println("阅读原文链接:"+source_url);
//          system.out.println("封面图片地址:"+cover);          
          
          post post = new post();
          post.setbiz(biz);
          post.settitle(title);
          post.settitleencode(title_encode);
          post.setfieldid(fileid);
          post.setdigest(digest);
          post.setsourceurl(source_url);
          post.setcover(cover);
          post.setistop(1);//标记一下是头条内容
          post.setismulti(is_multi);
          post.setdatetime(datetime);
          post.setcontenturl(content_url);
          
          postmapper.insert(post);
        }
      
        if(is_multi == 1){//如果是多图文消息
          list<object> multilists = jsonpath.read(json, "['app_msg_ext_info']['multi_app_msg_item_list']");
          for(object multilist : multilists){
            object multijson = multilist;          
            content_url = jsonpath.read(multijson, "['content_url']").tostring().replace("\\", "").replaceall("amp;", "");//图文消息链接地址
            /**
             * 这里再次根据$content_url判断一下数据库中是否重复以免出错
             */
            contenturlexist = false;
            list<post> posts = postmapper.selectbycontenturl(content_url);
            if(posts != null && posts.size() != 0){
              contenturlexist = true;
            }
            if(!contenturlexist){//'数据库中不存在相同的$content_url'
              /**
               * 在这里将图文消息链接地址插入到采集队列库中
               * （队列库将在后文介绍，主要目的是建立一个批量采集队列，
               * 另一个程序将根据队列安排下一个采集的公众号或者文章内容）
               */
              if(content_url != null && !"".equals(content_url)){
                tmplist tmplistt = new tmplist();
                tmplistt.setcontenturl(content_url);
                tmplistmapper.insertselective(tmplistt);
              }
              
              string title = jsonpath.read(multijson, "$.title");
              string title_encode = urlencoder.encode(title, "utf-8");
              integer fileid = jsonpath.read(multijson, "$.fileid");
              string digest = jsonpath.read(multijson, "$.digest");
              string source_url = jsonpath.read(multijson, "$.source_url");
              source_url = source_url.replace("\\", "");
              string cover = jsonpath.read(multijson, "$.cover");
              cover = cover.replace("\\", "");            
//              system.out.println("标题:"+title);
//              system.out.println("微信id:"+fileid);
//              system.out.println("文章摘要:"+digest);
//              system.out.println("阅读原文链接:"+source_url);
//              system.out.println("封面图片地址:"+cover);              
              post post = new post();
              post.setbiz(biz);
              post.settitle(title);
              post.settitleencode(title_encode);
              post.setfieldid(fileid);
              post.setdigest(digest);
              post.setsourceurl(source_url);
              post.setcover(cover);
              post.setistop(0);//标记一下不是头条内容
              post.setismulti(is_multi);
              post.setdatetime(datetime);
              post.setcontenturl(content_url);
              
              postmapper.insert(post);
              
            }
          }
        }      
      }    
    }
  }

处理公众号文章页的方法：

public string getwxpost() {
    // todo auto-generated method stub
    /**
     * 当前页面为公众号文章页面时，读取这个程序
     * 首先删除采集队列表中load=1的行
     * 然后从队列表中按照“order by id asc”选择多行(注意这一行和上面的程序不一样)
     */
    tmplistmapper.deletebyload(1);
    list<tmplist> queues = tmplistmapper.selectmany(5);
    string url = "";
    if(queues != null && queues.size() != 0 && queues.size() > 1){
      tmplist queue = queues.get(0);
      url = queue.getcontenturl();
      queue.setisload(1);
      int result = tmplistmapper.updatebyprimarykey(queue);
      system.out.println("update result:"+result);
    }else{
      system.out.println("getpost queues is null?"+queues==null?null:queues.size());
      weixin weixin = weixinmapper.selectone();
      string biz = weixin.getbiz();
      if((math.random()>0.5?1:0) == 1){
        url = "http://mp.weixin.qq.com/mp/getmasssendmsg?__biz=" + biz + 
            "#wechat_webview_type=1&wechat_redirect";//拼接公众号历史消息url地址（第一种页面形式）
      }else{
        url = "https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=" + biz + 
            "#wechat_redirect";//拼接公众号历史消息url地址（第二种页面形式）
      }
      url = "https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=" + biz + 
          "#wechat_redirect";//拼接公众号历史消息url地址（第二种页面形式）
      //更新刚才提到的公众号表中的采集时间time字段为当前时间戳。
      weixin.setcollect(system.currenttimemillis());
      int result = weixinmapper.updatebyprimarykey(weixin);
      system.out.println("getpost weixin updateresult:"+result);
    }
    int randomtime = new random().nextint(3) + 3;
    string jscode = "<script>settimeout(function(){window.location.href='"+url+"';},"+randomtime*1000+");</script>";
    return jscode;
    
  }

处理公众号点赞量和阅读量的方法：

public void getmsgext(string str,string url) {
    // todo auto-generated method stub
    string biz = "";
    string sn = "";
    map<string,string> querystrs = httpurlparser.parseurl(url);
    if(querystrs != null){
      biz = querystrs.get("__biz");
      biz = biz + "==";
      sn = querystrs.get("sn");
      sn = "%" + sn + "%";
    }
    /**
     * $sql = "select * from `文章表` where `biz`='".$biz."'
     * and `content_url` like '%".$sn."%'" limit 0,1;
     * 根据biz和sn找到对应的文章
     */
    post post = postmapper.selectbybizandsn(biz, sn);
    
    if(post == null){
      system.out.println("biz:"+biz);
      system.out.println("sn:"+sn);
      tmplistmapper.deletebyload(1);
      return;
    }
    
//    system.out.println("json数据:"+str);
    integer read_num;
    integer like_num;
    try{
      read_num = jsonpath.read(str, "['appmsgstat']['read_num']");//阅读量
      like_num = jsonpath.read(str, "['appmsgstat']['like_num']");//点赞量
    }catch(exception e){
      read_num = 123;//阅读量
      like_num = 321;//点赞量
      system.out.println("read_num:"+read_num);
      system.out.println("like_num:"+like_num);
      system.out.println(e.getmessage());
    }    
    
    /**
     * 在这里同样根据sn在采集队列表中删除对应的文章，代表这篇文章可以移出采集队列了
     * $sql = "delete from `队列表` where `content_url` like '%".$sn."%'" 
     */
    tmplistmapper.deletebysn(sn);
    
    //然后将阅读量和点赞量更新到文章表中。
    post.setreadnum(read_num);
    post.setlikenum(like_num);
    postmapper.updatebyprimarykey(post);
    
  }

处理跳转向微信注入js的方法：

public string getwxhis() {
    string url = "";
    // todo auto-generated method stub
    /**
     * 当前页面为公众号历史消息时，读取这个程序
     * 在采集队列表中有一个load字段，当值等于1时代表正在被读取
     * 首先删除采集队列表中load=1的行
     * 然后从队列表中任意select一行
     */
    tmplistmapper.deletebyload(1);
    tmplist queue = tmplistmapper.selectrandomone();
    system.out.println("queue is null?"+queue);
    if(queue == null){//队列表为空
      /**
       * 队列表如果空了，就从存储公众号biz的表中取得一个biz，
       * 这里我在公众号表中设置了一个采集时间的time字段，按照正序排列之后，
       * 就得到时间戳最小的一个公众号记录，并取得它的biz
       */
      weixin weixin = weixinmapper.selectone();
      
      string biz = weixin.getbiz();
      url = "https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=" + biz + 
          "#wechat_redirect";//拼接公众号历史消息url地址（第二种页面形式）
      //更新刚才提到的公众号表中的采集时间time字段为当前时间戳。
      weixin.setcollect(system.currenttimemillis());
      int result = weixinmapper.updatebyprimarykey(weixin);
      system.out.println("gethis weixin updateresult:"+result);
    }else{
      //取得当前这一行的content_url字段
      url = queue.getcontenturl();
      //将load字段update为1
      tmplistmapper.updatebycontenturl(url);
    }
    //将下一个将要跳转的$url变成js脚本，由anyproxy注入到微信页面中。
    //echo "<script>settimeout(function(){window.location.href='".$url."';},2000);</script>";
    int randomtime = new random().nextint(3) + 3;
    string jscode = "<script>settimeout(function(){window.location.href='"+url+"';},"+randomtime*1000+");</script>";
    return jscode;
  }

以上就是对处理代理服务器拦截到的数据进行处理的程序。这里有一个需要注意的问题，程序会对数据库中的每个收录的公众号进行轮循访问，甚至是已经存储的文章也会再次访问，目的是为了一直更新文章的阅读数和点赞数。如果需要抓取大量的公众号建议对添加任务队列的代码进行修改，添加条件限制，否则公众号一多轮循抓取重复数据将十分影响效率。

至此就将微信公众号的文章链接全部爬取到，而且这个链接是永久有效而且可以在浏览器打开的链接，接下来就是写爬虫程序从数据库中拿链接爬取文章内容等信息了。

我是用webmagic写的爬虫，轻量好用。

public class spidermodel implements pageprocessor{
  
  private static postmapper postmapper;
  
  private static list<post> posts;
  
  // 抓取网站的相关配置，包括编码、抓取间隔、重试次数等
  private site site = site.me().setretrytimes(3).setsleeptime(100);
  
  public site getsite() {
    // todo auto-generated method stub
    return this.site;
  }
  
  public void process(page page) {
    // todo auto-generated method stub
    post post = posts.remove(0);
    string content = page.gethtml().xpath("//div[@id='js_content']").get();
    //存在和谐文章 此处做判定如果有直接删除记录或设置表示位表示文章被和谐
    if(content == null){
      system.out.println("文章已和谐！");
      //postmapper.deletebyprimarykey(post.getid());
      return;
    }
    string contentsnap = content.replaceall("data-src", "src").replaceall("preview.html", "player.html");//快照
    string contenttxt = htmltoword.striphtml(content);//纯文本内容
    
    selectable metacontent = page.gethtml().xpath("//div[@id='meta_content']");
    string pubtime = null;
    string wxname = null;
    string author = null;
    if(metacontent != null){
      pubtime = metacontent.xpath("//em[@id='post-date']").get();
      if(pubtime != null){
        pubtime = htmltoword.striphtml(pubtime);//文章发布时间
      }
      wxname = metacontent.xpath("//a[@id='post-user']").get();
      if(wxname != null){
        wxname = htmltoword.striphtml(wxname);//公众号名称
      }
      author = metacontent.xpath("//em[@class='rich_media_meta rich_media_meta_text' and @id!='post-date']").get();
      if(author != null){
        author = htmltoword.striphtml(author);//文章作者
      }
    }
    
//    system.out.println("发布时间:"+pubtime);
//    system.out.println("公众号名称:"+wxname);
//    system.out.println("文章作者:"+author);
    
    string title = post.gettitle().replaceall(" ", "");//文章标题
    string digest = post.getdigest();//文章摘要
    int likenum = post.getlikenum();//文章点赞数
    int readnum = post.getreadnum();//文章阅读数
    string contenturl = post.getcontenturl();//文章链接
    
    wechatinfobean wechatbean = new wechatinfobean();
    wechatbean.settitle(title);
    wechatbean.setcontent(contenttxt);//纯文本内容
    wechatbean.setsourcecode(contentsnap);//快照
    wechatbean.setlikecount(likenum);
    wechatbean.setviewcount(readnum);
    wechatbean.setabstracttext(digest);//摘要
    wechatbean.seturl(contenturl);
    wechatbean.setpublishtime(pubtime);
    wechatbean.setsitename(wxname);//站点名称 公众号名称
    wechatbean.setauthor(author);
    wechatbean.setmediatype("微信公众号");//来源媒体类型
    
    wechatstorage.savewechatinfo(wechatbean);
    
    //标示文章已经被爬取
    post.setisspider(1);
    postmapper.updatebyprimarykey(post);
    
  }  
  
  public static void startspider(list<post> inposts,postmapper mypostmapper,string... urls){
    
    long starttime, endtime;
    starttime = system.currenttimemillis();
    postmapper = mypostmapper;
    posts = inposts;
    
    httpclientdownloader httpclientdownloader = new httpclientdownloader();    
    spidermodel spidermodel = new spidermodel();
    spider myspider = spider.create(spidermodel).addurl(urls);
    myspider.setdownloader(httpclientdownloader);
    try {
      spidermonitor.instance().register(myspider);
      myspider.thread(1).run();
    } catch (jmexception e) {
      e.printstacktrace();
    }
    
    endtime = system.currenttimemillis();
    system.out.println("爬取时间" + ((endtime - starttime) / 1000) + "秒--");
    
  }
  
}

其它的一些无关逻辑的存储数据代码就不贴了，这里我把代理服务器抓取到的数据存在了mysql，把自己的爬虫程序爬到的数据存储在了mongodb。

下面是自己爬取到的公众号号的信息：

详细Java批量获取微信公众号方法

上一篇： Spring Boot利用@Async异步调用：使用Future及定义超时详解

下一篇： spring boot实现上传图片并在页面上显示及遇到的问题小结

详细Java批量获取微信公众号方法