HttpClient模拟浏览器抓取网页
程序员文章站
2022-05-29 22:49:52
...
1. 设置请求头消息User-Agent模拟浏览器
2. 获取响应内容类型Content-Type
3. 获取响应状态Status
Demo01.java package com.andrew.httpClient.chap02; import org.apache.http.HttpEntity; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.util.EntityUtils; public class Demo01 { public static void main(String[] args) throws Exception { CloseableHttpClient httpClient = HttpClients.createDefault(); // 创建httpClient实例 HttpGet httpGet = new HttpGet("http://www.tuicool.com/"); // 创建http get实例 // 模拟浏览器 httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); CloseableHttpResponse response = httpClient.execute(httpGet); // 执行http get请求 HttpEntity entity = response.getEntity(); // 获取返回实体 System.out.println("网页内容:" + EntityUtils.toString(entity, "utf-8")); // 获取网页内容 response.close(); // response关闭 httpClient.close(); // httpClient关闭 } }
2. 获取响应内容类型Content-Type
Demo02.java package com.andrew.httpClient.chap02; import org.apache.http.HttpEntity; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; public class Demo02 { public static void main(String[] args) throws Exception { CloseableHttpClient httpClient = HttpClients.createDefault(); // 创建httpClient实例 HttpGet httpGet = new HttpGet("http://central.maven.org/maven2/HTTPClient/HTTPClient/0.3-3/HTTPClient-0.3-3.jar"); // 创建httpget实例 httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); CloseableHttpResponse response = httpClient.execute(httpGet); // 执行http get请求 HttpEntity entity = response.getEntity(); // 获取返回实体 System.out.println("Content-Type:" + entity.getContentType().getValue()); // System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // // 获取网页内容 response.close(); // response关闭 httpClient.close(); // httpClient关闭 } } 运行结果: Content-Type:application/java-archive
3. 获取响应状态Status
200正常 403拒绝 500服务器报错 400未找到页面
Demo03.java package com.andrew.httpClient.chap02; import org.apache.http.HttpEntity; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; public class Demo03 { public static void main(String[] args) throws Exception { CloseableHttpClient httpClient = HttpClients.createDefault(); // 创建httpClient实例 HttpGet httpGet = new HttpGet("http://www.open1111.com/"); // 创建httpget实例 httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); CloseableHttpResponse response = httpClient.execute(httpGet); // 执行http get请求 System.out.println("Status:" + response.getStatusLine().getStatusCode()); HttpEntity entity = response.getEntity(); // 获取返回实体 System.out.println("Content-Type:" + entity.getContentType().getValue()); // System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // // 获取网页内容 response.close(); // response关闭 httpClient.close(); // httpClient关闭 } } Status:200 Content-Type:text/html;charset=UTF-8
上一篇: HttpClient使用代理IP
下一篇: HttpClient抓取图片
推荐阅读
-
Selenium(python版本)如何启动浏览器模拟点击网页链接或按钮?
-
Python使用Selenium模块模拟浏览器抓取斗鱼直播间信息示例
-
Python使用Selenium模块实现模拟浏览器抓取淘宝商品美食信息功能示例
-
谷歌浏览器怎么模拟手机浏览访问网页的效果?
-
使用PHP curl模拟浏览器抓取网站信息
-
模拟post请求抓取网页资源数据,用正则表达式获取有用数据
-
模拟post请求抓取网页资源数据,用正则表达式获取有用数据
-
HttpClient4入门应用之一----抓取网站内容(解决中文乱码) HttpClientHttpClient4网页中文乱码
-
httpclient自动获取页面编码设置进行字符编码,使httpclient适用所有网页抓取不乱码
-
HttpClient来模拟浏览器GET POST