Java爬虫之Htmlunit,HttpClient的使用
程序员文章站
2022-03-02 21:09:55
...
博客链接:Cs XJH’s Blog
由于最近接手一个项目需要爬取网页数据,故学习了下爬虫的相关知识。
都说Python是专业的爬虫工具,但奈何项目是用Java写的,所以从Maven的仓库中找到了Htmlunit和HttpClient这两个工具。熟悉之后发现,其实他们也是很强大好用的。
首先,说明下环境:
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>2.2.0.RELEASE</version>
<relativePath/> <!-- lookup parent from repository -->
</parent>
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
</dependency>
htmlunit和httpclient的版本继承了spring-boot-starter-parent中默认定义的。
Htmlunit
相对于httpclient来说,htmlunit是更陌生的。htmlunit是一个可以模拟操作浏览器的工具,并且支持JS后台执行。此外,它支持DOM,CSS,Xpath三种方式解析html。
htmlunit的优势在于它模拟登陆十分方便,不需要构造表单数据,而是直接填充;并且对于前后端结合的网页项目来说,使用它解析html十分方便。
另外,Java爬虫中还有一个很有名的工具Jsoup,它和htmlunit在解析html上一样强大,但是对于模拟登陆来说,它需要构造表单数据。而登陆往往会有像XSS安全防护,甚至一些表单数据构造起来相当麻烦。所以,对于本项目来说,htmlunit更符合需求。
配置
// 注入IOC容器
@Bean
public WebClient getWebClient() {
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setRedirectEnabled(true);
// 允许重定向
webClient.getOptions().setJavaScriptEnabled(true);
// 启动JS解释器
webClient.getOptions().setCssEnabled(false);
// 禁用CSS支持
webClient.getOptions().setActiveXNative(false);
// 是否启用ActiveX(用于动画,视频之类)
webClient.getOptions().setThrowExceptionOnScriptError(false);
// js运行错误时,不抛出异常
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
// 状态码不为200时不报错
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
// 设置Ajax异步处理控制器即启用Ajax支持
webClient.setJavaScriptTimeout(10 * 1000);
return webClient;
}
使用
// 模拟登陆
public void login(WebClient webClient) throws BaseException {
webClient.getCookieManager().clearCookies(); // 清空cookie
String homeUrl = "";
try {
HtmlPage loginPage = webClient.getPage(loginUrl);
webClient.waitForBackgroundJavaScript(1000);
HtmlTextInput nameInput = loginPage.getHtmlElementById(userNameElement);
HtmlTextInput pwdInput = loginPage.getHtmlElementById(userPwdElement);
HtmlButton submit = loginPage.getHtmlElementById(submitElement);
nameInput.setText(userName());
pwdInput.setText(userPwd());
HtmlPage nextPage = submit.click();
homeUrl = nextPage.getBaseURL().toString();
}
catch (IOException e) {
e.printStackTrace();
}
if (!homeUrl.equals(homeUrl())) {
throw new BaseException(1002, "账号或密码错误");
}
}
// 解析Html Table
HtmlPage coursePage = webClient.getPage(url);
HtmlTable courseTable = coursePage.getHtmlElementById("Table");
for (int i=0,rLen=courseTable.getRowCount(); i<rLen; i++) {
HtmlTableRow row = courseTable.getRow(i);
for (int j=0,cLen=row.getCells().size(); j<cLen; j++) {
}
}
HttpClient
httpclient是用来模拟发送http请求的工具,常用于解析restful 风格的接口的响应。并且,它不适用与解析html。所以,httpclient只适用于在前后端分离的网页上爬取数据。
配置
// httpclient的存储cookie对象
@Bean
public CookieStore getCookieStore() {
return new BasicCookieStore();
}
@Bean(name = "httpClient")
public CloseableHttpClient getHttpClient(CookieStore cookieStore) {
return HttpClients.custom()
.setDefaultCookieStore(cookieStore)
.build();
}
// 构造支持https请求的httpclient
@Bean(name = "httpsClient")
public HttpClient getHttpsClient() {
SSLConnectionSocketFactory sslsf = null;
try {
SSLContext sslContext = SSLContext.getInstance("TLS");
sslContext.init(null, new TrustManager[] {
new X509TrustManager() {
@Override
public void checkClientTrusted(X509Certificate[] x509Certificates, String s) throws CertificateException {
}
@Override
public void checkServerTrusted(X509Certificate[] x509Certificates, String s) throws CertificateException {
}
@Override
public X509Certificate[] getAcceptedIssuers() {
return null;
}
}
}, null);
sslsf = new SSLConnectionSocketFactory(sslContext, NoopHostnameVerifier.INSTANCE);
}catch (NoSuchAlgorithmException | KeyManagementException e) {
e.printStackTrace();
}
return HttpClients.custom().setSSLSocketFactory(sslsf)
.setMaxConnTotal(50)
.setMaxConnPerRoute(50)
.setDefaultRequestConfig(RequestConfig.custom()
.setConnectionRequestTimeout(60000)
.setConnectTimeout(60000)
.setSocketTimeout(60000)
.build())
.build();
}
使用
BasicClientCookie cookie = new BasicClientCookie(sessionName, session.getValue());
cookie.setDomain(session.getDomain());
cookie.setExpiryDate(session.getExpires());
cookie.setPath(session.getPath());
cookieStore.addCookie(cookie);
CloseableHttpResponse response = null;
try {
// 默认ContentType是application/x-www-form-urlencoded
UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params, "UTF-8");
httpPost.setEntity(formEntity);
// 发送post请求
response = httpClient.execute(httpPost);
String json = EntityUtils.toString(response.getEntity());
// 关闭entity
EntityUtils.consume(response.getEntity());
}catch (IOException e) {
e.printStackTrace();
}finally {
try {
response.close(); // 关闭响应
}catch (IOException e) {
e.printStackTrace();
}
}