htmlunit爬虫优化方案

程序员文章站 2022-07-12 16:15:43

...

发现很多人搞爬虫会把python作为首选技术，理由是简单，作为一家公司技术栈，多出一样语言是要多出很多维护成本的；本人最熟悉的还是java，所以对java内存浏览器技术htmlunit做了一次研究，发现原生的htmlunit的性能及对多线程的支持不是那么友好，特别是使用代理ip后，oom是很正常的，监控程序并查看源码总结问题原因：

1、js执行器执行js是使用多线程执行，在关闭js执行线程的时候，使用com.gargoylesoftware.htmlunit.javascript.background.DefaultJavaScriptExecutor这个类的时候，有段代码。

引用

private void killThread() {
        if (eventLoopThread_ == null) {
            return;
        }
        try {
            eventLoopThread_.interrupt();
            eventLoopThread_.join(10_000);
        }
        catch (final InterruptedException e) {
            LOG.warn("InterruptedException while waiting for the eventLoop thread to join ", e);
            // ignore, this doesn't matter, we want to stop it
        }
        if (eventLoopThread_.isAlive()) {
            if (LOG.isWarnEnabled()) {
                LOG.warn("Event loop thread "
                        + eventLoopThread_.getName()
                        + " still alive at "
                        + System.currentTimeMillis());
                LOG.warn("Event loop thread will be stopped");
            }

            // Stop the thread
            eventLoopThread_.stop();
        }
    }

上面代码的问题：

引用

eventLoopThread_.interrupt();
eventLoopThread_.join(10_000);

不要觉得interrupt真的就会关闭线程，比如正在执行io操作或者同样该线程在处于sleep状态，interrupt就不会终止线程，所以这里住线程要等待eventLoopThread执行10秒才会继续往下跑。

2、DefaultJavaScriptExecutor在使用外部线程池开启webclient抓取网页的时候，经常会出现线程不关闭的情况，问题代码如下：

引用

public void run() {
        final boolean trace = LOG.isTraceEnabled();
        // this has to be a multiple of 10ms
        // otherwise the VM has to fight with the OS to get such small periods
        final long sleepInterval = 10;
        while (!shutdown_.get() && !Thread.currentThread().isInterrupted() && webClient_.get() != null) {
            final JavaScriptJobManager jobManager = getJobManagerWithEarliestJob();

            if (jobManager != null) {
                final JavaScriptJob earliestJob = jobManager.getEarliestJob();
                if (earliestJob != null) {
                    final long waitTime = earliestJob.getTargetExecutionTime() - System.currentTimeMillis();

                    // do we have to execute the earliest job
                    if (waitTime < 1) {
                        // execute the earliest job
                        if (trace) {
                            LOG.trace("started executing job at " + System.currentTimeMillis());
                        }
                        jobManager.runSingleJob(earliestJob);
                        if (trace) {
                            LOG.trace("stopped executing job at " + System.currentTimeMillis());
                        }

                        // job is done, have a look for another one
                        continue;
                    }
                }
            }

            // check for cancel
            if (shutdown_.get() || Thread.currentThread().isInterrupted() || webClient_.get() == null) {
                break;
            }

            // nothing to do, let's sleep a bit
            try {
                Thread.sleep(sleepInterval);
            }
            catch (final InterruptedException e) {
                Thread.currentThread().interrupt();
            }
        }
    }

此处问题代码：

引用

while (!shutdown_.get() && !Thread.currentThread().isInterrupted() && webClient_.get() != null)

外部线程内部关闭webclient.close()的时候，当外部线程要主动关闭本线程的时候，就像outStream没把out.close()写在finally里面，永远不会关闭js执行器线程。

3、其实htmlunit性能差还有一个最重要的问题所在，就是每次抓取同一个页面，都要去下载相同的资源，htmlunit下载页面的代码是在类com.gargoylesoftware.htmlunit.HttpWebConnection里面（js，css，jpg）

主要的方法代码如下：

引用

/**
     * Reads the content of the stream and saves it in memory or on the file system.
     * @param is the stream to read
     * @param maxInMemory the maximumBytes to store in memory, after which save to a local file
     * @return a wrapper around the downloaded content
     * @throws IOException in case of read issues
     */
    public static DownloadedContent downloadContent(final InputStream is, final int maxInMemory) throws IOException {
        if (is == null) {
            return new DownloadedContent.InMemory(null);
        }

        try (ByteArrayOutputStream bos = new ByteArrayOutputStream()) {
            final byte[] buffer = new byte[1024];
            int nbRead;
            try {
                while ((nbRead = is.read(buffer)) != -1) {
                    bos.write(buffer, 0, nbRead);
                    if (bos.size() > maxInMemory) {
                        // we have exceeded the max for memory, let's write everything to a temporary file
                        final File file = File.createTempFile("htmlunit", ".tmp");
                        file.deleteOnExit();
                        try (OutputStream fos = Files.newOutputStream(file.toPath())) {
                            bos.writeTo(fos); // what we have already read
                            IOUtils.copyLarge(is, fos); // what remains from the server response
                        }
                        return new DownloadedContent.OnFile(file, true);
                    }
                }
            }
            catch (final ConnectionClosedException e) {
                LOG.warn("Connection was closed while reading from stream.", e);
                return new DownloadedContent.InMemory(bos.toByteArray());
            }
            catch (final EOFException e) {
                // this might happen with broken gzip content
                LOG.warn("EOFException while reading from stream.", e);
                return new DownloadedContent.InMemory(bos.toByteArray());
            }

            return new DownloadedContent.InMemory(bos.toByteArray());
        }
    }

修改代码，只要把重复下载的代码缓存起来，就可以大大增加抓取性能，同时还可以动态修改网页js。

4、htmlunit设置requestTimeout时是无法单独设置conectiontimeout和socketTimeout，比方说设置requestTimeout=10000，那么 htmlclient的conectiontimeout=10000和socketTimeout=10000,这是有问题的，conectiontimeout一般情况应该设置低于100毫秒为宜，设置代码在 com.gargoylesoftware.htmlunit.HttpWebConnection
方法：

引用

private static RequestConfig.Builder createRequestConfigBuilder(final int timeout, final InetAddress localAddress) {
        final RequestConfig.Builder requestBuilder = RequestConfig.custom()
                .setCookieSpec(HACKED_COOKIE_POLICY)
                .setRedirectsEnabled(false)
                .setLocalAddress(localAddress)

                // timeout
                .setConnectTimeout(timeout)
                .setConnectionRequestTimeout(timeout)
                .setSocketTimeout(timeout);
        return requestBuilder;
    }

综上，把上述几点都完善，htmlunit不比只能多进程的python爬虫性能差，而且能够做黑帽。

相关标签： htmlunit优化爬虫多线程爬虫 java爬虫

上一篇： windows上部署linux集群应用

下一篇： shell脚本部署无密码访问

htmlunit爬虫优化方案

商城网站SEO优化方案（带成功案例）

怎么从百度算法的升级中学会反思和调整网站优化方案

网站高跳出率太高优化网站跳出率方案

网站优化方案中网站栏目页如何设计

Nginx服务优化配置方案

优化大师流氓行径详细分析及修复方案

简要辨析智能DNS、CDN加速、双线加速三种优化服务方案

seo优化方案怎么写?一份详细又优秀的seo优化方案介绍

海量数据库的查询优化及分页算法方案

如何写出一份优秀的seo优化方案

htmlunit爬虫优化方案

商城网站SEO优化方案（带成功案例）

怎么从百度算法的升级中学会反思和调整网站优化方案

网站高跳出率太高 优化网站跳出率方案

网站优化方案中网站栏目页如何设计

Nginx服务优化配置方案

优化大师流氓行径详细分析及修复方案

简要辨析智能DNS、CDN加速、双线加速三种优化服务方案

seo优化方案怎么写?一份详细又优秀的seo优化方案介绍

海量数据库的查询优化及分页算法方案

如何写出一份优秀的seo优化方案

网站高跳出率太高优化网站跳出率方案