欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Java爬虫之Htmlunit,HttpClient的使用

程序员文章站 2022-03-02 21:09:55
...

博客链接:Cs XJH’s Blog

由于最近接手一个项目需要爬取网页数据,故学习了下爬虫的相关知识。
都说Python是专业的爬虫工具,但奈何项目是用Java写的,所以从Maven的仓库中找到了Htmlunit和HttpClient这两个工具。熟悉之后发现,其实他们也是很强大好用的。

首先,说明下环境:

<parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.2.0.RELEASE</version>
        <relativePath/> <!-- lookup parent from repository -->
</parent>
<dependency>
        <groupId>net.sourceforge.htmlunit</groupId>
        <artifactId>htmlunit</artifactId> 
</dependency>
<dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
</dependency>

htmlunit和httpclient的版本继承了spring-boot-starter-parent中默认定义的。

Htmlunit

相对于httpclient来说,htmlunit是更陌生的。htmlunit是一个可以模拟操作浏览器的工具,并且支持JS后台执行。此外,它支持DOM,CSS,Xpath三种方式解析html。

htmlunit的优势在于它模拟登陆十分方便,不需要构造表单数据,而是直接填充;并且对于前后端结合的网页项目来说,使用它解析html十分方便。

另外,Java爬虫中还有一个很有名的工具Jsoup,它和htmlunit在解析html上一样强大,但是对于模拟登陆来说,它需要构造表单数据。而登陆往往会有像XSS安全防护,甚至一些表单数据构造起来相当麻烦。所以,对于本项目来说,htmlunit更符合需求。

配置

// 注入IOC容器
@Bean
public WebClient getWebClient() {
	WebClient webClient = new WebClient(BrowserVersion.CHROME);
	webClient.getOptions().setRedirectEnabled(true);
	// 允许重定向
	webClient.getOptions().setJavaScriptEnabled(true);
	// 启动JS解释器
	webClient.getOptions().setCssEnabled(false);
	// 禁用CSS支持
	webClient.getOptions().setActiveXNative(false);
	// 是否启用ActiveX(用于动画,视频之类)
	webClient.getOptions().setThrowExceptionOnScriptError(false);
	// js运行错误时,不抛出异常
	webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
	// 状态码不为200时不报错
	webClient.setAjaxController(new NicelyResynchronizingAjaxController());
	// 设置Ajax异步处理控制器即启用Ajax支持
	webClient.setJavaScriptTimeout(10 * 1000);
	return webClient;
}

使用

// 模拟登陆
public void login(WebClient webClient) throws BaseException {
	webClient.getCookieManager().clearCookies();  // 清空cookie
	
	String homeUrl = "";
	try {
		HtmlPage loginPage = webClient.getPage(loginUrl);
		webClient.waitForBackgroundJavaScript(1000);

		HtmlTextInput nameInput = loginPage.getHtmlElementById(userNameElement);
		HtmlTextInput pwdInput = loginPage.getHtmlElementById(userPwdElement);
		HtmlButton submit = loginPage.getHtmlElementById(submitElement);

		nameInput.setText(userName());
		pwdInput.setText(userPwd());

		HtmlPage nextPage = submit.click();
		homeUrl = nextPage.getBaseURL().toString();
	}
	catch (IOException e) {
		e.printStackTrace();
	}

	if (!homeUrl.equals(homeUrl())) {
		throw new BaseException(1002, "账号或密码错误");
	}
}


// 解析Html Table
HtmlPage coursePage = webClient.getPage(url);
HtmlTable courseTable = coursePage.getHtmlElementById("Table");

for (int i=0,rLen=courseTable.getRowCount(); i<rLen; i++) {
	HtmlTableRow row = courseTable.getRow(i);
  
	for (int j=0,cLen=row.getCells().size(); j<cLen; j++) {
      
	}
}

HttpClient

httpclient是用来模拟发送http请求的工具,常用于解析restful 风格的接口的响应。并且,它不适用与解析html。所以,httpclient只适用于在前后端分离的网页上爬取数据。

配置

// httpclient的存储cookie对象
@Bean
public CookieStore getCookieStore() {
	return new BasicCookieStore();
}

@Bean(name = "httpClient")
public CloseableHttpClient getHttpClient(CookieStore cookieStore) {
	return HttpClients.custom()
	                .setDefaultCookieStore(cookieStore)
	                .build();
}

// 构造支持https请求的httpclient
@Bean(name = "httpsClient")
public HttpClient getHttpsClient() {
	SSLConnectionSocketFactory sslsf = null;
	try {
		SSLContext sslContext = SSLContext.getInstance("TLS");
		sslContext.init(null, new TrustManager[] {
			new X509TrustManager() {
				@Override
				public void checkClientTrusted(X509Certificate[] x509Certificates, String s) throws CertificateException {
				}

				@Override
				public void checkServerTrusted(X509Certificate[] x509Certificates, String s) throws CertificateException {
				}

				@Override
				public X509Certificate[] getAcceptedIssuers() {
					return null;
				}
			}
		}, null);

		sslsf = new SSLConnectionSocketFactory(sslContext, NoopHostnameVerifier.INSTANCE);
	}catch (NoSuchAlgorithmException | KeyManagementException e) {
		e.printStackTrace();
	}

	return HttpClients.custom().setSSLSocketFactory(sslsf)
	                .setMaxConnTotal(50)
	                .setMaxConnPerRoute(50)
	                .setDefaultRequestConfig(RequestConfig.custom()
	                        .setConnectionRequestTimeout(60000)
	                        .setConnectTimeout(60000)
	                        .setSocketTimeout(60000)
	                        .build())
	                .build();
}

使用

BasicClientCookie cookie = new BasicClientCookie(sessionName, session.getValue());

cookie.setDomain(session.getDomain());
cookie.setExpiryDate(session.getExpires());
cookie.setPath(session.getPath());

cookieStore.addCookie(cookie);

CloseableHttpResponse response = null;
try {
	// 默认ContentType是application/x-www-form-urlencoded
	UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params, "UTF-8");
	httpPost.setEntity(formEntity);
  
	// 发送post请求
	response = httpClient.execute(httpPost);
	String json = EntityUtils.toString(response.getEntity());
  
	// 关闭entity
	EntityUtils.consume(response.getEntity());
}catch (IOException e) {
	e.printStackTrace();
}finally {
    try {
        response.close(); // 关闭响应
    }catch (IOException e) {
      e.printStackTrace();
    }
}