Java爬虫初体验

程序员文章站 2022-03-02 23:43:56

年关将近,工作上该完成的都差不多了,就剩些测试完改改的活,上午闲着就接触学习了一下爬虫,收益还是有的,纠结了很久的正则表达式终于搞得差不多了,Java的Regex和JS上还是有区别的,JS上的"\w"Java得写成"\\w",因为Java会对字符串中的"\"做转义,还有JS中"\S\s"的写法(指任 ......

年关将近,工作上该完成的都差不多了,就剩些测试完改改的活,上午闲着就接触学习了一下爬虫,收益还是有的,纠结了很久的正则表达式终于搞得差不多了,java的regex和js上还是有区别的,js上的"\w"java得写成"\\w",因为java会对字符串中的"\"做转义,还有js中"\s\s"的写法(指任意多的任意字符),java可以写成".*"

博主刚接触爬虫,参考了许多博客和问答贴,先写个爬虫的overview让朋友们对其有些印象,之后我们再展示代码.

网络爬虫的基本原理:

Java爬虫初体验

网络爬虫的基本工作流程如下：

1.首先选取一部分精心挑选的种子url；

2.将这些url放入待抓取url队列；

3.从待抓取url队列中取出待抓取在url，解析dns，并且得到主机的ip，并将url对应的网页下载下来，存储进已下载网页库中。此外，将这些url放进已抓取url队列。

4.分析已抓取url队列中的url，分析其中的其他url，并且将url放入待抓取url队列，从而进入下一个循环。

网络爬虫的抓取策略有:深度优先遍历,广度优先遍历(是不是想到了图的深度和广度优先遍历?),partial pagerank,opic策略,大站优先等;

博主采用的是实现起来比较简单的广度优先遍历,即获取一个页面中所有的url后,将之塞进url队列,于是循环条件就是该队列非空.

网络爬虫(helloworld版)

入口类:

package com.example.spiderman.page;

/**
 * @author yhw
 * @classname: mycrawler
 * @description:
 */
import com.example.spiderman.link.linkfilter;
import com.example.spiderman.link.links;
import com.example.spiderman.page.page;
import com.example.spiderman.page.pageparsertool;
import com.example.spiderman.page.requestandresponsetool;
import com.example.spiderman.utils.filetool;
import com.example.spiderman.utils.regexrule;
import org.jsoup.select.elements;

import java.util.set;

public class mycrawler {

    /**
     * 使用种子初始化 url 队列
     *
     * @param seeds 种子 url
     * @return
     */
    private void initcrawlerwithseeds(string[] seeds) {
        for (int i = 0; i < seeds.length; i++){
            links.addunvisitedurlqueue(seeds[i]);
        }
    }


    /**
     * 抓取过程
     *
     * @param seeds
     * @return
     */
    public void crawling(string[] seeds) {

        //初始化 url 队列
        initcrawlerwithseeds(seeds);

        //定义过滤器，提取以 变量url 开头的链接
        linkfilter filter = new linkfilter() {
            @override
            public boolean accept(string url) {
                if (url.startswith("https://www.cnblogs.com/joey44/"))
                    return true;
                else
                    return false;
            }
        };

        //循环条件：待抓取的链接不空且抓取的网页不多于 1000
        while (!links.unvisitedurlqueueisempty()  && links.getvisitedurlnum() <= 1000) {

            //先从待访问的序列中取出第一个；
            string visiturl = (string) links.removeheadofunvisitedurlqueue();
            if (visiturl == null){
                continue;
            }

            //根据url得到page;
            page page = requestandresponsetool.sendrequstandgetresponse(visiturl);

            //对page进行处理： 访问dom的某个标签
            elements es = pageparsertool.select(page,"a");
            if(!es.isempty()){
                system.out.println("下面将打印所有a标签： ");
                system.out.println(es);
            }

            //将保存文件
            filetool.savetolocal(page);

            //将已经访问过的链接放入已访问的链接中；
            links.addvisitedurlset(visiturl);

            //得到超链接
            set<string> links = pageparsertool.getlinks(page,"a");
            for (string link : links) {
                regexrule regexrule = new regexrule();
                regexrule.addpositive("http.*/joey44/.*html.*");if(regexrule.satisfy(link)){
                    links.addunvisitedurlqueue(link);
                    system.out.println("新增爬取路径: " + link);
                }

            }
        }
    }


    //main 方法入口
    public static void main(string[] args) {
        mycrawler crawler = new mycrawler();
        crawler.crawling(new string[]{"https://www.cnblogs.com/joey44/"});
    }
}

网络编程自然少不了http访问,直接用apache大佬的httpclient包就行:

package com.example.spiderman.page;

/**
 * @author yhw
 * @classname: requestandresponsetool
 * @description:
 */

import org.apache.commons.httpclient.defaulthttpmethodretryhandler;
import org.apache.commons.httpclient.httpclient;
import org.apache.commons.httpclient.httpexception;
import org.apache.commons.httpclient.httpstatus;
import org.apache.commons.httpclient.methods.getmethod;
import org.apache.commons.httpclient.params.httpmethodparams;

import java.io.ioexception;

public class requestandresponsetool {


    public static page  sendrequstandgetresponse(string url) {
        page page = null;
        // 1.生成 httpclinet 对象并设置参数
        httpclient httpclient = new httpclient();
        // 设置 http 连接超时 5s
        httpclient.gethttpconnectionmanager().getparams().setconnectiontimeout(5000);
        // 2.生成 getmethod 对象并设置参数
        getmethod getmethod = new getmethod(url);
        // 设置 get 请求超时 5s
        getmethod.getparams().setparameter(httpmethodparams.so_timeout, 5000);
        // 设置请求重试处理
        getmethod.getparams().setparameter(httpmethodparams.retry_handler, new defaulthttpmethodretryhandler());
        // 3.执行 http get 请求
        try {
            int statuscode = httpclient.executemethod(getmethod);
            // 判断访问的状态码
            if (statuscode != httpstatus.sc_ok) {
                system.err.println("method failed: " + getmethod.getstatusline());
            }
            // 4.处理 http 响应内容
            byte[] responsebody = getmethod.getresponsebody();// 读取为字节 数组
            string contenttype = getmethod.getresponseheader("content-type").getvalue(); // 得到当前返回类型
            page = new page(responsebody,url,contenttype); //封装成为页面
        } catch (httpexception e) {
            // 发生致命的异常，可能是协议不对或者返回的内容有问题
            system.out.println("please check your provided http address!");
            e.printstacktrace();
        } catch (ioexception e) {
            // 发生网络异常
            e.printstacktrace();
        } finally {
            // 释放连接
            getmethod.releaseconnection();
        }
        return page;
    }
}

别忘了导入maven依赖:

<!-- https://mvnrepository.com/artifact/commons-httpclient/commons-httpclient -->
        <dependency>
            <groupid>commons-httpclient</groupid>
            <artifactid>commons-httpclient</artifactid>
            <version>3.0</version>
        </dependency>

接下来要让我们的程序可以存儲页面,新建page实体类:

package com.example.spiderman.page;

/**
 * @author yhw
 * @classname: page
 * @description:
 */
import com.example.spiderman.utils.charsetdetector;
import org.jsoup.jsoup;
import org.jsoup.nodes.document;

import java.io.unsupportedencodingexception;

/*
 * page
 *   1: 保存获取到的响应的相关内容;
 * */
public class page {

    private byte[] content ;
    private string html ;  //网页源码字符串
    private document doc  ;//网页dom文档
    private string charset ;//字符编码
    private string url ;//url路径
    private string contenttype ;// 内容类型


    public page(byte[] content , string url , string contenttype){
        this.content = content ;
        this.url = url ;
        this.contenttype = contenttype ;
    }

    public string getcharset() {
        return charset;
    }
    public string geturl(){return url ;}
    public string getcontenttype(){ return contenttype ;}
    public byte[] getcontent(){ return content ;}

    /**
     * 返回网页的源码字符串
     *
     * @return 网页的源码字符串
     */
    public string gethtml() {
        if (html != null) {
            return html;
        }
        if (content == null) {
            return null;
        }
        if(charset==null){
            charset = charsetdetector.guessencoding(content); // 根据内容来猜测 字符编码
        }
        try {
            this.html = new string(content, charset);
            return html;
        } catch (unsupportedencodingexception ex) {
            ex.printstacktrace();
            return null;
        }
    }

    /*
     *  得到文档
     * */
    public document getdoc(){
        if (doc != null) {
            return doc;
        }
        try {
            this.doc = jsoup.parse(gethtml(), url);
            return doc;
        } catch (exception ex) {
            ex.printstacktrace();
            return null;
        }
    }


}

然后要有页面的解析功能类:

package com.example.spiderman.page;

/**
 * @author yhw
 * @classname: pageparsertool
 * @description:
 */
import org.jsoup.nodes.element;
import org.jsoup.select.elements;

import java.util.arraylist;
import java.util.hashset;
import java.util.iterator;
import java.util.set;

public class pageparsertool {


    /* 通过选择器来选取页面的 */
    public static elements select(page page , string cssselector) {
        return page.getdoc().select(cssselector);
    }

    /*
     *  通过css选择器来得到指定元素;
     *
     *  */
    public static element select(page page , string cssselector, int index) {
        elements eles = select(page , cssselector);
        int realindex = index;
        if (index < 0) {
            realindex = eles.size() + index;
        }
        return eles.get(realindex);
    }


    /**
     * 获取满足选择器的元素中的链接 选择器cssselector必须定位到具体的超链接
     * 例如我们想抽取id为content的div中的所有超链接，这里
     * 就要将cssselector定义为div[id=content] a
     *  放入set 中 防止重复；
     * @param cssselector
     * @return
     */
    public static  set<string> getlinks(page page ,string cssselector) {
        set<string> links  = new hashset<string>() ;
        elements es = select(page , cssselector);
        iterator iterator  = es.iterator();
        while(iterator.hasnext()) {
            element element = (element) iterator.next();
            if ( element.hasattr("href") ) {
                links.add(element.attr("abs:href"));
            }else if( element.hasattr("src") ){
                links.add(element.attr("abs:src"));
            }
        }
        return links;
    }



    /**
     * 获取网页中满足指定css选择器的所有元素的指定属性的集合
     * 例如通过getattrs("img[src]","abs:src")可获取网页中所有图片的链接
     * @param cssselector
     * @param attrname
     * @return
     */
    public static arraylist<string> getattrs(page page , string cssselector, string attrname) {
        arraylist<string> result = new arraylist<string>();
        elements eles = select(page ,cssselector);
        for (element ele : eles) {
            if (ele.hasattr(attrname)) {
                result.add(ele.attr(attrname));
            }
        }
        return result;
    }
}

别忘了导入maven依赖:

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
        <dependency>
            <groupid>org.jsoup</groupid>
            <artifactid>jsoup</artifactid>
            <version>1.11.3</version>
        </dependency>

最后就是额外需要的一些工具类:正则匹配工具,页面编码侦测和存儲页面工具

package com.example.spiderman.utils;

/**
 * @author yhw
 * @classname: regexrule
 * @description:
 */
import java.util.arraylist;
import java.util.regex.pattern;

public class regexrule {

    public regexrule(){

    }
    public regexrule(string rule){
        addrule(rule);
    }

    public regexrule(arraylist<string> rules){
        for (string rule : rules) {
            addrule(rule);
        }
    }

    public boolean isempty(){
        return positive.isempty();
    }

    private arraylist<string> positive = new arraylist<string>();
    private arraylist<string> negative = new arraylist<string>();



    /**
     * 添加一个正则规则 正则规则有两种，正正则和反正则
     * url符合正则规则需要满足下面条件： 1.至少能匹配一条正正则 2.不能和任何反正则匹配
     * 正正则示例：+a.*c是一条正正则，正则的内容为a.*c，起始加号表示正正则
     * 反正则示例：-a.*c时一条反正则，正则的内容为a.*c，起始减号表示反正则
     * 如果一个规则的起始字符不为加号且不为减号，则该正则为正正则，正则的内容为自身
     * 例如a.*c是一条正正则，正则的内容为a.*c
     * @param rule 正则规则
     * @return 自身
     */
    public regexrule addrule(string rule) {
        if (rule.length() == 0) {
            return this;
        }
        char pn = rule.charat(0);
        string realrule = rule.substring(1);
        if (pn == '+') {
            addpositive(realrule);
        } else if (pn == '-') {
            addnegative(realrule);
        } else {
            addpositive(rule);
        }
        return this;
    }



    /**
     * 添加一个正正则规则
     * @param positiveregex
     * @return 自身
     */
    public regexrule addpositive(string positiveregex) {
        positive.add(positiveregex);
        return this;
    }


    /**
     * 添加一个反正则规则
     * @param negativeregex
     * @return 自身
     */
    public regexrule addnegative(string negativeregex) {
        negative.add(negativeregex);
        return this;
    }


    /**
     * 判断输入字符串是否符合正则规则
     * @param str 输入的字符串
     * @return 输入字符串是否符合正则规则
     */
    public boolean satisfy(string str) {

        int state = 0;
        for (string nregex : negative) {
            if (pattern.matches(nregex, str)) {
                return false;
            }
        }

        int count = 0;
        for (string pregex : positive) {
            if (pattern.matches(pregex, str)) {
                count++;
            }
        }
        if (count == 0) {
            return false;
        } else {
            return true;
        }

    }
}

package com.example.spiderman.utils;

/**
 * @author yhw
 * @classname: charsetdetector
 * @description:
 */


import org.mozilla.universalchardet.universaldetector;

import java.io.unsupportedencodingexception;
import java.util.regex.matcher;
import java.util.regex.pattern;

/**
 * 字符集自动检测
 **/
public class charsetdetector {

    //从nutch借鉴的网页编码检测代码
    private static final int chunk_size = 2000;

    private static pattern metapattern = pattern.compile(
            "<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>",
            pattern.case_insensitive);
    private static pattern charsetpattern = pattern.compile(
            "charset=\\s*([a-z][_\\-0-9a-z]*)", pattern.case_insensitive);
    private static pattern charsetpatternhtml5 = pattern.compile(
            "<meta\\s+charset\\s*=\\s*[\"']?([a-z][_\\-0-9a-z]*)[^>]*>",
            pattern.case_insensitive);

    //从nutch借鉴的网页编码检测代码
    private static string guessencodingbynutch(byte[] content) {
        int length = math.min(content.length, chunk_size);

        string str = "";
        try {
            str = new string(content, "ascii");
        } catch (unsupportedencodingexception e) {
            return null;
        }

        matcher metamatcher = metapattern.matcher(str);
        string encoding = null;
        if (metamatcher.find()) {
            matcher charsetmatcher = charsetpattern.matcher(metamatcher.group(1));
            if (charsetmatcher.find()) {
                encoding = new string(charsetmatcher.group(1));
            }
        }
        if (encoding == null) {
            metamatcher = charsetpatternhtml5.matcher(str);
            if (metamatcher.find()) {
                encoding = new string(metamatcher.group(1));
            }
        }
        if (encoding == null) {
            if (length >= 3 && content[0] == (byte) 0xef
                    && content[1] == (byte) 0xbb && content[2] == (byte) 0xbf) {
                encoding = "utf-8";
            } else if (length >= 2) {
                if (content[0] == (byte) 0xff && content[1] == (byte) 0xfe) {
                    encoding = "utf-16le";
                } else if (content[0] == (byte) 0xfe
                        && content[1] == (byte) 0xff) {
                    encoding = "utf-16be";
                }
            }
        }

        return encoding;
    }

    /**
     * 根据字节数组，猜测可能的字符集，如果检测失败，返回utf-8
     *
     * @param bytes 待检测的字节数组
     * @return 可能的字符集，如果检测失败，返回utf-8
     */
    public static string guessencodingbymozilla(byte[] bytes) {
        string default_encoding = "utf-8";
        universaldetector detector = new universaldetector(null);
        detector.handledata(bytes, 0, bytes.length);
        detector.dataend();
        string encoding = detector.getdetectedcharset();
        detector.reset();
        if (encoding == null) {
            encoding = default_encoding;
        }
        return encoding;
    }

    /**
     * 根据字节数组，猜测可能的字符集，如果检测失败，返回utf-8
     * @param content 待检测的字节数组
     * @return 可能的字符集，如果检测失败，返回utf-8
     */
    public static string guessencoding(byte[] content) {
        string encoding;
        try {
            encoding = guessencodingbynutch(content);
        } catch (exception ex) {
            return guessencodingbymozilla(content);
        }

        if (encoding == null) {
            encoding = guessencodingbymozilla(content);
            return encoding;
        } else {
            return encoding;
        }
    }
}

package com.example.spiderman.utils;

/**
 * @author yhw
 * @classname: filetool
 * @description:
 */
import com.example.spiderman.page.page;

import java.io.dataoutputstream;
import java.io.file;
import java.io.fileoutputstream;
import java.io.ioexception;

/*  本类主要是 下载那些已经访问过的文件*/
public class filetool {

    private static string dirpath;


    /**
     * getmethod.getresponseheader("content-type").getvalue()
     * 根据 url 和网页类型生成需要保存的网页的文件名，去除 url 中的非文件名字符
     */
    private static string getfilenamebyurl(string url, string contenttype) {
        //去除 http://
        url = url.substring(7);
        //text/html 类型
        if (contenttype.indexof("html") != -1) {
            url = url.replaceall("[\\?/:*|<>\"]", "_") + ".html";
            return url;
        }
        //如 application/pdf 类型
        else {
            return url.replaceall("[\\?/:*|<>\"]", "_") + "." +
                    contenttype.substring(contenttype.lastindexof("/") + 1);
        }
    }

    /*
     *  生成目录
     * */
    private static void mkdir() {
        if (dirpath == null) {
            dirpath = class.class.getclass().getresource("/").getpath() + "temp\\";
        }
        file filedir = new file(dirpath);
        if (!filedir.exists()) {
            filedir.mkdir();
        }
    }

    /**
     * 保存网页字节数组到本地文件，filepath 为要保存的文件的相对地址
     */

    public static void savetolocal(page page) {
        mkdir();
        string filename =  getfilenamebyurl(page.geturl(), page.getcontenttype()) ;
        string filepath = dirpath + filename ;
        byte[] data = page.getcontent();
        try {
            //files.lines(paths.get("d:\\jd.txt"), standardcharsets.utf_8).foreach(system.out::println);
            dataoutputstream out = new dataoutputstream(new fileoutputstream(new file(filepath)));
            for (int i = 0; i < data.length; i++) {
                out.write(data[i]);
            }
            out.flush();
            out.close();
            system.out.println("文件："+ filename + "已经被存储在"+ filepath  );
        } catch (ioexception e) {
            e.printstacktrace();
        }
    }

}

为了拓展性,增加一个linkfilter接口:

package com.example.spiderman.link;

/**
 * @author yhw
 * @classname: linkfilter
 * @description:
 */
public interface linkfilter {
    public boolean accept(string url);
}

整体项目结构:

Java爬虫初体验