爬虫之多线程

程序员文章站 2022-03-09 09:26:06

...

1.引入

之前写的爬虫都是单个线程的，一旦某个地方卡住不动了，那就要演员等待下去了，所以我们可以使用多线程或多进程来处理

但是我个人不建议用，不过还是简单的介绍下

2.使用

爬虫使用多线程来处理网络请求，使用线程来处理URL队列中的url，然后将url返回的结果保存在另一个队列中，其它线程在读取这个队列中的数据，然后写到文件中去

3. 主要组成部分

3.1 URL队列和结果队列

将要爬取的url放在一个队列中，这里使用标准库Queue，访问url后的结果保存在结果队列中

初始化一个URL队列

from queue import Queue
url_queue=Queue()
html_queue=Queue()

3.2 请求线程

使用多个线程，不停的取URL队列中的url，并进行处理：

from threading import Thread
class ThreadInfo(Thread):
    def __init__(self,url_queue,html_queue):
        Thread.__init__(self)
        self.url_queue=url_queue
        self.html_queue=html_queue
    def run(self):
        user_agents = [
            "User-Agent:Mozilla/5.0(Windows;U;WindowsNT6.1;en-us)AppleWebKit/534.50(KHTML,likeGecko)Version/5.1Safari/534.50",
            "User-Agent:Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"
        ]
        headers = {
            "User-Agent": choice(user_agents)
        }
        while self.url_queue.empty()==False:
            url=self.url_queue.get()
            response = requests.get(url,headers=headers)
            if response.status_code==200:
                self.html_queue.put(response.text)

如果队列为空，线程就会被阻塞，知道队列不为空，处理队列中的一条数据后，就需要通知队列已经这条数据处理完

3.3 处理线程

处理结果队列中的数据，并保存到文件中，如果使用多个线程的话必须要给文件加上锁

lock=threading.Lock()
f=codecs.open('xiaohua.txt','w','utf-8')

当线程需要写入文件的时候，可以这样处理：

with lock:
	f.write(something)

4. 一个小例子

这里举一个爬取糗事百科的段子的例子，

from threading import Thread
from queue import Queue
from lxml import etree
from random import choice
import requests

#爬虫类
class CrawlInfo(Thread):
    def __init__(self,url_queue,html_queue):
        Thread.__init__(self)
        self.url_queue=url_queue
        self.html_queue=html_queue
    def run(self):
        user_agents = [
            "User-Agent:Mozilla/5.0(Windows;U;WindowsNT6.1;en-us)AppleWebKit/534.50(KHTML,likeGecko)Version/5.1Safari/534.50",
            "User-Agent:Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"
        ]
        headers = {
            "User-Agent": choice(user_agents)
        }
        while self.url_queue.empty()==False:#url队列不为空的时候
            url=self.url_queue.get()
            response = requests.get(url,headers=headers)
            if response.status_code==200:
                self.html_queue.put(response.text)

#解析类
class ParseInfo(Thread):
    def __init__(self,html_queue):
        Thread.__init__(self)
        self.html_queue=html_queue
    def run(self):
        while self.html_queue.empty()==False:
            e = etree.HTML(self.html_queue.get())
            span_list=e.xpath('//div[@class="content"]/span[1]')
            with open('xiaohua.txt','a',encoding='utf-8') as f:
                for span in span_list:
                    info=span.xpath('string(.)')
                    f.write(info+'\n')
if __name__=='__main__':
    url_queue=Queue()#用来存储url的容器
    base_url="https://www.qiushibaike.com/text/page/{}/"
    html_queue=Queue()#用来存储爬取到的整个页面的html，还未解析
    for i in range(1,14):
        new_url=base_url.format(i)
        print(new_url)
        url_queue.put(new_url)

    Crawl_list=[]#用来放爬虫类的线程，因为下面要让3个线程都等待，所以需要存起来
    for i in range(0,3):#创建3个线程
        Crawl1=CrawlInfo(url_queue,html_queue)
        Crawl_list.append(Crawl1)
        Crawl1.start()

    for crawl in Crawl_list:
        crawl.join()

    parse_list=[]
    for i in range(0,3):

        parse=ParseInfo(html_queue)
        parse_list.append(parse)
        parse.start()
    for parse in parse_list:
        parse.join()

上一篇： JavaScript数组方法--includes、indexOf、lastIndexOf

下一篇：用jQuery技术实现Tab页界面之二_jquery

爬虫之多线程

1.引入

2.使用

3. 主要组成部分

3.1 URL队列和结果队列

3.2 请求线程

3.3 处理线程

4. 一个小例子

Java concurrency线程池之线程池原理(一)_动力节点Java学院整理

Java concurrency线程池之线程池原理(二)_动力节点Java学院整理

iOS中多线程的经典崩溃总结大全

Android应用程序模型之应用程序，任务，进程，线程分析

基于一个应用程序多线程误用的分析详解

深入线程安全容器的实现方法

Android开发之多线程中实现利用自定义控件绘制小球并完成小球自动下落功能实例

JAVA实现多线程的两种方法实例分享

java模拟多线程http请求代码分享

Java编程实现多线程TCP服务器完整实例