htmlUnit的使用
程序员文章站
2022-05-05 14:48:24
...
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.26</version>
</dependency>
htmlUnit模拟浏览器请求
webClient有三个构造方法:第二个可以实现浏览器版本的指定,第三个可以实现指定代理服务器。
WebClient() Creates a web client instance using the browser version returned by
BrowserVersion.getDefault() . |
WebClient(BrowserVersion browserVersion) Creates a web client instance using the specified
BrowserVersion . |
WebClient(BrowserVersion browserVersion, String proxyHost, int proxyPort) Creates an instance that will use the specified
BrowserVersion and proxy server. |
public static void main(String[] args) throws Exception {
WebClient webClient=new WebClient(BrowserVersion.FIREFOX_52); // 实例化Web客户端
HtmlPage page=webClient.getPage("www.baidu.com"); // 解析获取页面
System.out.println("网页html:"+page.asXml());
System.out.println("==================================");
System.out.println("网页文本:"+page.asText());
webClient.close(); // 关闭客户端,释放内存
}
htmlUnit获取指定元素
下面是官方文档,page的一些get方法,可以很方便的获取指定的元素。
HtmlAnchor |
getAnchorByHref(String href) Returns the
HtmlAnchor with the specified href. |
HtmlAnchor |
getAnchorByName(String name) Returns the
HtmlAnchor with the specified name. |
HtmlAnchor |
getAnchorByText(String text) Returns the first anchor with the specified text.
|
List<HtmlAnchor> |
getAnchors() Returns a list of all anchors contained in this page.
|
URL |
getBaseURL() The base URL used to resolve relative URLs.
|
HtmlElement |
getBody() Returns the body element (or frameset element), or
null if it does not yet exist. |
Charset |
getCharset() Returns the encoding.
|
String |
getContentType() Returns the content type of this page.
|
HtmlElement |
getDocumentElement() Returns the document element.
|
String |
getDocumentURI() Not yet implemented.
|
DOMConfiguration |
getDomConfig() Not yet implemented.
|
DomElement |
getElementById(String elementId) |
<E extends DomElement> |
getElementByName(String name) Returns the element with the specified name.
|
HtmlElement |
getElementFromPoint(int x, int y) INTERNAL API - SUBJECT TO CHANGE AT ANY TIME - USE AT YOUR OWN RISK.
|
List<DomElement> |
getElementsById(String elementId) Returns the elements with the specified ID.
|
List<DomElement> |
getElementsByIdAndOrName(String idAndOrName) Returns the elements with the specified string for their name or ID.
|
List<DomElement> |
getElementsByName(String name) Returns the elements with the specified name attribute.
|
DomElement |
getFocusedElement() Returns the element with the focus or null if no element has the focus.
|
HtmlForm |
getFormByName(String name) Returns the first form that matches the specified name.
|
List<HtmlForm> |
getForms() Returns a list of all the forms in this page.
|
FrameWindow |
getFrameByName(String name) Returns the first frame contained in this page with the specified name.
|
List<FrameWindow> |
getFrames() Returns a list containing all the frames (from frame and iframe tags) in this page.
|
URL |
getFullyQualifiedUrl(String relativeUrl) Given a relative URL (ie /foo), returns a fully-qualified URL based on the URL that was used to load this page.
|
HtmlElement |
getHead() Returns the head element.
|
HtmlElement |
getHtmlElementByAccessKey(char accessKey) Returns the HTML element that is assigned to the specified access key.
|
<E extends HtmlElement> |
getHtmlElementById(String elementId) Returns the HTML element with the specified ID.
|
List<HtmlElement> |
getHtmlElementsByAccessKey(char accessKey) Returns all the HTML elements that are assigned to the specified access key.
|
DOMImplementation |
getImplementation() Not yet implemented.
|
String |
getInputEncoding() Not yet implemented.
|
protected List<HtmlMeta> |
getMetaTags(String httpEquiv) Gets the meta tag for a given
http-equiv value. |
Map<String,String> |
getNamespaces() Returns all namespaces defined in the root element of this page.
|
Document |
getOwnerDocument() |
HtmlPage |
getPage() Returns the page that contains this node.
|
String |
getResolvedTarget(String elementTarget) Given a target attribute value, resolve the target using a base target for the page.
|
List<org.w3c.dom.ranges.Range> |
getSelectionRanges() INTERNAL API - SUBJECT TO CHANGE AT ANY TIME - USE AT YOUR OWN RISK.
|
boolean |
getStrictErrorChecking() Not yet implemented.
|
List<String> |
getTabbableElementIds() Returns a list of ids (strings) that correspond to the tabbable elements in this page.
|
List<HtmlElement> |
getTabbableElements() Returns a list of all elements that are tabbable in the order that will be used for tabbing.
|
String |
getTitleText() Returns the title of this page or an empty string if the title wasn't specified.
|
String |
getXmlEncoding() |
boolean |
getXmlStandalone() |
String |
getXmlVersion() |
htmlUnit取消css,javascript支持
css没啥用
但是js是有用的,但是如果你用不到,取消就好了。可以提高效率。
webClient.getOptions().setCssEnabled(false); // 取消css支持
webClient.getOptions().setJavaScriptEnabled(false); // 取消javascript支持
HtmlPage page=webClient.getPage("url"); // 解析获取页面HtmlForm form=page.getFormByName("formName"); // 获取搜索FormHtmlTextInput textField=form.getInputByName("q"); // 获取查询文本框HtmlSubmitInput button=form.getInputByName("submitButton"); // 获取提交按钮textField.setValueAttribute("java"); // 文本框“填入”数据HtmlPage resultPage=button.click(); // 模拟点击 获取查询结果页面System.out.println(resultPage.asXml());
webClient.close();
上一篇: htmlunit 学习笔记