关于gethostbyname在多线程环境下的阻塞问题

程序员文章站 2022-05-02 12:52:49

...

为什么80%的码农都做不了架构师？>>> 关于gethostbyname在多线程环境下的阻塞问题

Unix/Linux下的gethostbyname函数常用来向DNS查询一个域名的IP地址。由于DNS的递归查询，常常会发生gethostbyname函数在查询一个域名时严重超时。而该函数又不能像connect和read等函数那样通过setsockopt或者select函数那样设置超时时间，因此常常成为程序的瓶颈。有人提出一种解决办法是用alarm设置定时信号，如果超时就用setjmp和longjmp跳过gethostbyname函数（这种方式我没有试过，不知道具体效果如何）。在多线程下面，gethostbyname会一个更严重的问题，就是如果有一个线程的gethostbyname发生阻塞，其它线程都会在gethostbyname处发生阻塞。我在编写爬虫时也遇到了这个让我疑惑很久的问题，所有的爬虫线程都阻塞在gethostbyname处，导致爬虫速度非常慢。在网上google了很长时间这个问题，也没有找到解答。今天凑巧在实验室的googlegroup里面发现了一本电子书"Mining the Web - Discovering Knowledge from Hypertext Data",其中在讲解爬虫时有下面几段文字：

Many clients for DNS resolution are coded poorly.Most UNIX systems provide an implementation of gethostbyname (the DNS client API—application program interface), which cannot concurrently handle multiple outstanding requests. Therefore, the crawler cannot issue many resolution requests together and poll at a later time for completion of individual requests, which is critical for acceptable performance. Furthermore, if the system-provided client is used, there is no way to distribute load among a number of DNS servers. For all these reasons, many crawlers choose to include their own custom client for DNS name resolution. The Mercator crawler from Compaq System Research Center reduced the time spent in DNS from as high as 87% to a modest 25% by implementing a custom client. The ADNS asynchronous DNS client library is ideal for use in crawlers.
In spite of these optimizations, a large-scale crawler will spend a substantial fraction of its network time not waiting for Http data transfer, but for address resolution. For every hostname that has not been resolved before (which happens frequently with crawlers), the local DNS may have to go across many network hops to fill its cache for the first time. To overlap this unavoidable delay with useful work, prefetching can be used. When a page that has just been fetched is parsed, a stream of HREFs is extracted. Right at this time, that is, even before any of the corresponding URLs are fetched, hostnames are extracted from the HREF targets, and DNS resolution requests are made to the caching server. The prefetching client is usually implemented using UDP  instead of TCP, and it does not wait for resolution to be completed. The request serves only to fill the DNS cache so that resolution will be fast when the page is actually needed later on.

大意是说unix的gethostbyname无法处理在并发程序下使用，这是先天的缺陷是无法改变的。大型爬虫往往不会使用gethostbyname，而是实现自己独立定制的DNS客户端。这样可以实现DNS的负载平衡，而且通过异步解析能够大大提高DNS解析速度。DNS客户端往往用UDP实现，可以在爬虫爬取网页前提前解析URL的IP。文章中还提到了一个开源的异步DNS库adns，主页是http://www.chiark.greenend.org.uk/~ian/adns/
从以上可看出，gethostbyname并不适用于多线程环境以及其它对DNS解析速度要求较高的程序。

转载于:https://my.oschina.net/qeecoo/blog/368671

上一篇： Python多线程中阻塞(join)与锁(Lock)的使用误区

下一篇：初学者复习笔记-列表部分【具体内容笔记还在更新中...】

关于gethostbyname在多线程环境下的阻塞问题

关于.net环境下跨进程、高频率读写数据的问题

关于Windows环境下hadoop的nodemanager无法启动的问题

在PHP+Apache+MySQL环境下(windows系统)，连接SQLServer数据出现“can not find driver”问题的解决办法

spring中bean的更新方法及在不同环境下的问题

使用Interlocked在多线程下进行原子操作，无锁无阻塞的实现线程运行状态判断

解决JMX 采用hessian协议在NAT网络环境下的通信问题

DISCUZ在win2003环境下 Unable to access ./include/common.inc.php in... 的问题终极解决方案

关于OpenGL在VS2019下找不到glew32.dll问题的解决

关于windos10环境下编译python3版pjsua库的问题

新人~关于Linux下配置LUMP环境的有关问题