[福利]java爬虫之jsoup、htmlunit
程序员文章站
2022-03-02 21:09:13
...
本爬虫主要用于jsoup和htmlunit新手练习
目标网站:1024美图 http://www.1024meitu.com
总体设计:首页–通过分页条件获取所有的文章
详细设计:
pom.xml
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.27</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.3</version>
</dependency>
目前做法是 通过分页获取所有页面—获取所有的文章—获取所有的图片
0、初始化
static WebClient webClient = getWebClient();
static String filepath ="E:\\test\\mt\\";
public static WebClient getWebClient() {
final WebClient webclient = new WebClient(BrowserVersion.FIREFOX_52);
webclient.setAjaxController(new NicelyResynchronizingAjaxController());
webclient.getOptions().setCssEnabled(false);
webclient.getOptions().setJavaScriptEnabled(true);
webclient.getOptions().setRedirectEnabled(false);
webclient.getOptions().setThrowExceptionOnScriptError(false);
webclient.setCookieManager(new CookieManager());
webclient.setJavaScriptTimeout(5000);
webclient.getOptions().setTimeout(100000);
return webclient;
}
1、通过分页栏 获取所有页面
public static List<String> getAllPage(HtmlPage page) throws FailingHttpStatusCodeException, MalformedURLException, IOException{
List<String> pages = new ArrayList<>();
List<HtmlAnchor> anchors = page.getAnchors();
for (HtmlAnchor htmlAnchor : anchors) {
String attribute = htmlAnchor.getAttribute("class");
if ("page-number".equals(attribute)) {
pages.add(htmlAnchor.getHrefAttribute());
}
}
return pages;
}
2、获取所有的文章
public static List<String> getAllArticles() throws FailingHttpStatusCodeException, MalformedURLException, IOException{
HtmlPage basepage = webClient.getPage(base_url);
List<String> articles = parsePage(basepage);
List<String> pages = getAllPage(basepage);
for (String page : pages) {
articles.addAll(parsePage((HtmlPage)webClient.getPage(page)));
}
return articles;
}
public static List<String> parsePage(HtmlPage page) {
List<String> articles = new ArrayList<>();
List<HtmlAnchor> anchors = page.getAnchors();
for (HtmlAnchor htmlAnchor : anchors) {
String attribute = htmlAnchor.getAttribute("rel");
if ("bookmark".equals(attribute)) {
articles.add(htmlAnchor.getHrefAttribute());
}
}
return articles;
}
3、获取所有图片
public static Map<String, List<String>> getImgs(String article) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
Map<String, List<String>> map = new HashMap<>();
HtmlPage basepage = webClient.getPage(article);
Document doc = Jsoup.parse(basepage.asXml());
String title = doc.title();
List<String> imgs = new ArrayList<>();
Element element = doc.select(".content-reset").get(0);
Elements elements = element.getElementsByTag("img");
for (Element element2 : elements) {
String src = element2.attr("src");
imgs.add(src);
}
map.put(title, imgs);
return map;
}
4、持久化 我简单的保存到文本文件
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
System.out.println(new Date());
Map<String, List<String>> map = new HashMap<>();
List<String> list = getAllArticles();
for (String string : list) {
map.putAll(getImgs(string));
}
for (String string : map.keySet()) {
List<String> list2 = map.get(string);
for (String string2 : list2) {
write(filepath+string, string2);
}
}
System.out.println(new Date());
}
public static void write(String file, String conent) throws IOException {
File file2 = new File(file);
if (!file2.exists()) {
file2.createNewFile();
}
BufferedWriter out = null;
try {
out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file, true)));
out.write(conent + "\r\n");
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
到这里 我们就已经拿到所有的文章标题+对应的图片了
唯一有点难度的就是 如何从html页面里 提取出我们想要的内容 主要还是要靠 id name class 等属性 以及 各种标签,而jsoup和htmlunit提供了非常方便的方法,多用用吧
上一篇: Scrapy Pipeline
下一篇: scrapy之Items
推荐阅读
-
Java爬虫实现爬取京东上的手机搜索页面 HttpCliient+Jsoup
-
Java 实现 HttpClients+jsoup,Jsoup,htmlunit,Headless Chrome 爬虫抓取数据
-
java爬虫Jsoup简单学习
-
Java爬虫--利用HttpClient和Jsoup爬取博客数据并存入数据库
-
Java爬虫之爬取中国高校排名前100名并存入MongoDB中
-
Java爬虫系列三:使用Jsoup解析HTML
-
网络爬虫1之HttpClient抓取数据、Jsoup解析数据
-
java使用Jsoup简单爬虫
-
java之网络爬虫介绍
-
Android Studio 爬虫 之 简单实现使用 jsoup/okhttp3 爬取购物商品信息的案例demo(附有详细步骤)