欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

[福利]java爬虫之jsoup、htmlunit

程序员文章站 2022-03-02 21:09:13
...

本爬虫主要用于jsoup和htmlunit新手练习

目标网站:1024美图 http://www.1024meitu.com

总体设计:首页–通过分页条件获取所有的文章
详细设计:
pom.xml

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>2.27</version>
</dependency>
<dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.3</version>
        </dependency>

目前做法是 通过分页获取所有页面—获取所有的文章—获取所有的图片
0、初始化

static WebClient webClient = getWebClient();
static String filepath ="E:\\test\\mt\\";
    public static WebClient getWebClient() {
        final WebClient webclient = new WebClient(BrowserVersion.FIREFOX_52);
        webclient.setAjaxController(new NicelyResynchronizingAjaxController());
        webclient.getOptions().setCssEnabled(false);
        webclient.getOptions().setJavaScriptEnabled(true);
        webclient.getOptions().setRedirectEnabled(false);
        webclient.getOptions().setThrowExceptionOnScriptError(false);
        webclient.setCookieManager(new CookieManager());
        webclient.setJavaScriptTimeout(5000);
        webclient.getOptions().setTimeout(100000);
        return webclient;
    }

1、通过分页栏 获取所有页面

public static List<String> getAllPage(HtmlPage page) throws FailingHttpStatusCodeException, MalformedURLException, IOException{
        List<String> pages = new ArrayList<>();
        List<HtmlAnchor> anchors = page.getAnchors();
        for (HtmlAnchor htmlAnchor : anchors) {
            String attribute = htmlAnchor.getAttribute("class");
            if ("page-number".equals(attribute)) {
                pages.add(htmlAnchor.getHrefAttribute());
            }
        }
        return pages;
    }

2、获取所有的文章

public static List<String> getAllArticles() throws FailingHttpStatusCodeException, MalformedURLException, IOException{
        HtmlPage basepage = webClient.getPage(base_url);
        List<String> articles = parsePage(basepage);
        List<String> pages = getAllPage(basepage);
        for (String page : pages) {
            articles.addAll(parsePage((HtmlPage)webClient.getPage(page)));
        }
        return articles;
    }
public static List<String> parsePage(HtmlPage page) {
        List<String> articles = new ArrayList<>();
        List<HtmlAnchor> anchors = page.getAnchors();
        for (HtmlAnchor htmlAnchor : anchors) {
            String attribute = htmlAnchor.getAttribute("rel");
            if ("bookmark".equals(attribute)) {
                articles.add(htmlAnchor.getHrefAttribute());
            }
        }
        return articles;
    }

3、获取所有图片

public static Map<String, List<String>> getImgs(String article) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
        Map<String, List<String>> map = new HashMap<>();
        HtmlPage basepage = webClient.getPage(article);
        Document doc = Jsoup.parse(basepage.asXml());
        String title = doc.title();
        List<String> imgs = new ArrayList<>();
        Element element = doc.select(".content-reset").get(0);
        Elements elements = element.getElementsByTag("img");
        for (Element element2 : elements) {
            String src = element2.attr("src");
            imgs.add(src);
        }
        map.put(title, imgs);
        return map;
    }

4、持久化 我简单的保存到文本文件

public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
        System.out.println(new Date());
        Map<String, List<String>> map = new HashMap<>();
        List<String> list = getAllArticles();
        for (String string : list) {
            map.putAll(getImgs(string));
        }

        for (String string : map.keySet()) {
            List<String> list2 = map.get(string);
            for (String string2 : list2) {
                write(filepath+string, string2);
            }

        }
        System.out.println(new Date());
    }

    public static void write(String file, String conent) throws IOException {
        File file2 = new File(file);
        if (!file2.exists()) {
            file2.createNewFile();
        }
        BufferedWriter out = null;
        try {
            out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file, true)));
            out.write(conent + "\r\n");
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                out.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

到这里 我们就已经拿到所有的文章标题+对应的图片了

唯一有点难度的就是 如何从html页面里 提取出我们想要的内容 主要还是要靠 id name class 等属性 以及 各种标签,而jsoup和htmlunit提供了非常方便的方法,多用用吧

相关标签: 爬虫

上一篇: Scrapy Pipeline

下一篇: scrapy之Items