欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

爬虫使用采坑记录

程序员文章站 2022-07-03 19:34:29
爬虫使用采坑记录之前因为项目需要做了一个简单的x宝爬虫项目,因为第一次使用爬虫。许多东西都是模糊不清直接百度copy代码去做。后来采用htmlunit+jsoup结合终于拿到了数据。方法大概如下。public static Document getTaobaoDetail(String url) throws FailingHttpStatusCodeException, MalformedU......

爬虫使用采坑记录
之前因为项目需要做了一个简单的x宝爬虫项目,因为第一次使用爬虫。许多东西都是模糊不清直接百度copy代码去做。后来采用htmlunit+jsoup结合终于拿到了数据。方法大概如下。
public static Document getTaobaoDetail(String url) throws FailingHttpStatusCodeException, MalformedURLException, IOException, InterruptedException {
//构造一个webClient 模拟Chrome 浏览器
WebClient webClient = new WebClient(BrowserVersion.CHROME);
//WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER);//IE 不出东西
//WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
//WebClient webClient = new WebClient(BrowserVersion.EDGE);
//屏蔽日志信息
LogFactory.getFactory().setAttribute(“org.apache.commons.logging.Log”,
“org.apache.commons.logging.impl.NoOpLog”);
java.util.logging.Logger.getLogger(“com.gargoylesoftware”).setLevel(Level.OFF);
//支持JavaScript 因为电商网页大部分是js动态加载。所以需要支持js
webClient.getOptions().setJavaScriptEnabled(true);
//设置主页
webClient.getOptions().setHomePage(“url”);//这里写需要爬取的网页
webClient.getBrowserVersion().setUserAgent(“Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36”);
//是否支持css支持
webClient.getOptions().setCssEnabled(false);
//本地ActiveX
webClient.getOptions().setActiveXNative(false);
//改变该WebClient行为脚本时出现错误
webClient.getOptions().setThrowExceptionOnScriptError(false);
//指定是否在出现故障状态代码时抛出异常
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.addRequestHeader(“accept”,“text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8”);
webClient.addRequestHeader(“accept-encoding”,“gzip, deflate, br”);
webClient.addRequestHeader(“accept-language”,“zh-CN,zh;q=0.9”);
webClient.addRequestHeader(“Connection”,“keep-alive”);
// webClient.addRequestHeader(“referer”,“可以去浏览器找,用不上”);
// webClient.addRequestHeader(“cookie”,“cookie”);//设置cookie方法大概是这样
String[] cookies=cookiestr.split(";");
for (int i = 0; i < cookies.length; i++) {
String str=cookies[i];
Cookie cookie = new Cookie(“s.taobao.com”, str.split("=")[0], str.split("=")[1] );
webClient.getCookieManager().addCookie(cookie);
}
int status = webClient.getPage(url).getWebResponse().getStatusCode();
System.out.println(“网页返回的状态码是===========”+status);
if(status302){//302貌似是重定向到新页面 这里如果有302代表数据没有拿到
Thread.sleep(2000);
}`
Page page = webClient.getPage(url);
URL redictUrl = page.getUrl();
System.out.println("重定向的url是
===="+redictUrl);
System.out.println(“page是==============”+page.toString());
HtmlPage rootPage = webClient.getPage(url);
//设置一个运行JavaScript的时间
webClient.waitForBackgroundJavaScript(10000);
try {
Thread.sleep(10000);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
String html = rootPage.asXml();
//在拿到页面后使用jsoup解析页面
Document document = Jsoup.parse(html);
webClient.close();
return document;
}

本文地址:https://blog.csdn.net/plkiop911/article/details/86576483

相关标签: 爬虫