欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

htmlUnit的使用

程序员文章站 2022-05-05 14:48:24
...
<dependency>
	    <groupId>net.sourceforge.htmlunit</groupId>
	    <artifactId>htmlunit</artifactId>
	    <version>2.26</version>
</dependency>
htmlUnit模拟浏览器请求
webClient有三个构造方法:第二个可以实现浏览器版本的指定,第三个可以实现指定代理服务器。
WebClient()
Creates a web client instance using the browser version returned by BrowserVersion.getDefault().
WebClient(BrowserVersion browserVersion)
Creates a web client instance using the specified BrowserVersion.
WebClient(BrowserVersion browserVersion, String proxyHost, int proxyPort)
Creates an instance that will use the specified BrowserVersion and proxy server.

public static void main(String[] args) throws Exception {
  WebClient webClient=new WebClient(BrowserVersion.FIREFOX_52); // 实例化Web客户端
  HtmlPage page=webClient.getPage("www.baidu.com"); // 解析获取页面
  System.out.println("网页html:"+page.asXml());
  System.out.println("==================================");
  System.out.println("网页文本:"+page.asText());
  webClient.close(); // 关闭客户端,释放内存
 }

htmlUnit获取指定元素
下面是官方文档,page的一些get方法,可以很方便的获取指定的元素。

HtmlAnchor getAnchorByHref(String href)
Returns the HtmlAnchor with the specified href.
HtmlAnchor getAnchorByName(String name)
Returns the HtmlAnchor with the specified name.
HtmlAnchor getAnchorByText(String text)
Returns the first anchor with the specified text.
List<HtmlAnchor> getAnchors()
Returns a list of all anchors contained in this page.
URL getBaseURL()
The base URL used to resolve relative URLs.
HtmlElement getBody()
Returns the body element (or frameset element), or null if it does not yet exist.
Charset getCharset()
Returns the encoding.
String getContentType()
Returns the content type of this page.
HtmlElement getDocumentElement()
Returns the document element.
String getDocumentURI()
Not yet implemented.
DOMConfiguration getDomConfig()
Not yet implemented.
DomElement getElementById(String elementId)
<E extends DomElement>
E
getElementByName(String name)
Returns the element with the specified name.
HtmlElement getElementFromPoint(int x, int y)
INTERNAL API - SUBJECT TO CHANGE AT ANY TIME - USE AT YOUR OWN RISK.
List<DomElement> getElementsById(String elementId)
Returns the elements with the specified ID.
List<DomElement> getElementsByIdAndOrName(String idAndOrName)
Returns the elements with the specified string for their name or ID.
List<DomElement> getElementsByName(String name)
Returns the elements with the specified name attribute.
DomElement getFocusedElement()
Returns the element with the focus or null if no element has the focus.
HtmlForm getFormByName(String name)
Returns the first form that matches the specified name.
List<HtmlForm> getForms()
Returns a list of all the forms in this page.
FrameWindow getFrameByName(String name)
Returns the first frame contained in this page with the specified name.
List<FrameWindow> getFrames()
Returns a list containing all the frames (from frame and iframe tags) in this page.
URL getFullyQualifiedUrl(String relativeUrl)
Given a relative URL (ie /foo), returns a fully-qualified URL based on the URL that was used to load this page.
HtmlElement getHead()
Returns the head element.
HtmlElement getHtmlElementByAccessKey(char accessKey)
Returns the HTML element that is assigned to the specified access key.
<E extends HtmlElement>
E
getHtmlElementById(String elementId)
Returns the HTML element with the specified ID.
List<HtmlElement> getHtmlElementsByAccessKey(char accessKey)
Returns all the HTML elements that are assigned to the specified access key.
DOMImplementation getImplementation()
Not yet implemented.
String getInputEncoding()
Not yet implemented.
protected List<HtmlMeta> getMetaTags(String httpEquiv)
Gets the meta tag for a given http-equiv value.
Map<String,String> getNamespaces()
Returns all namespaces defined in the root element of this page.
Document getOwnerDocument()
HtmlPage getPage()
Returns the page that contains this node.
String getResolvedTarget(String elementTarget)
Given a target attribute value, resolve the target using a base target for the page.
List<org.w3c.dom.ranges.Range> getSelectionRanges()
INTERNAL API - SUBJECT TO CHANGE AT ANY TIME - USE AT YOUR OWN RISK.
boolean getStrictErrorChecking()
Not yet implemented.
List<String> getTabbableElementIds()
Returns a list of ids (strings) that correspond to the tabbable elements in this page.
List<HtmlElement> getTabbableElements()
Returns a list of all elements that are tabbable in the order that will be used for tabbing.
String getTitleText()
Returns the title of this page or an empty string if the title wasn't specified.
String getXmlEncoding()
boolean getXmlStandalone()
String getXmlVersion()

htmlUnit取消css,javascript支持
css没啥用
但是js是有用的,但是如果你用不到,取消就好了。可以提高效率。
webClient.getOptions().setCssEnabled(false); // 取消css支持
webClient.getOptions().setJavaScriptEnabled(false); // 取消javascript支持

htmlUnit模拟按钮点击

HtmlPage page=webClient.getPage("url"); // 解析获取页面HtmlForm form=page.getFormByName("formName"); // 获取搜索FormHtmlTextInput textField=form.getInputByName("q"); // 获取查询文本框HtmlSubmitInput button=form.getInputByName("submitButton"); // 获取提交按钮textField.setValueAttribute("java"); // 文本框“填入”数据HtmlPage resultPage=button.click(); // 模拟点击 获取查询结果页面System.out.println(resultPage.asXml());
webClient.close();


相关标签: htmlUnit