Python 2 的网络爬虫

程序员文章站 2022-05-04 11:28:54

...

闲的蛋疼准备爬一下黄网的图片，结果最后代码写完获取到图片的资源的时候发现下不下来，只可以得到403，头疼-，-所以我的小计划破碎了，唉。分享一下代码吧，实际效果只能挨个打开，不能做到自动下载， urlretrieve函数下下来全是4k的打不开的图片= -

#coding=utf-8
import urllib
import urllib2
from bs4 import BeautifulSoup
import re
head = 'http://www.1111kf.com%s'
urlFloor0 = 'http://www.1111kf.com/artlist/25-%d.html'
i = 2
UserAgent = 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0'
headers = {'User-Agent':UserAgent}
urlFloor1 = []
k = 0
while (True):
    raw = raw_input('输入Enter表示进行')
    if raw == '':
        url = urlFloor0%i
        i+=1
        req = urllib2.Request(url=url,headers=headers)
        content = urllib2.urlopen(req).read()
        BS = BeautifulSoup(content,'html.parser')
        for strint in BS.find_all(target="_blank"):
            conment_id = str(strint.get('href'))
            if (conment_id.find('http')):
                urlFloor1.append(conment_id)
        for m in urlFloor1:
            pic_src = []
            url1 = head%m
            req = urllib2.Request(url=url1, headers=headers)
            content1 = urllib2.urlopen(req).read()
            pattern = re.compile('img.*?src.*?=.*?(http.*?jpg)')
            items = re.findall(pattern, content1)
            for item in items:
                print item






    else :
        break

上一篇： JQuery中的DOM操作

下一篇：异步网络爬虫的Python实现(2)

Python 2 的网络爬虫

用python爬虫批量下载pdf的实现

python爬虫十八：mongodb的简单介绍

FileNotFoundError: [Errno 2] No such file or directory: ‘errors.out‘ （python自然语言处理 5.6 最后的示例报错）

使用selenium框架的Python爬虫被检测到的解决方法

4g升5g的利与弊有哪些（2种网络对比分析）

python爬虫分布式获取数据的实例方法

网络爬虫的原理,该怎么解决

解决python和pycharm安装gmpy2 出现ERROR的问题

疯狂上涨的Python 开发者应从2.x还是3.x着手?

Python标准库urllib2的一些使用细节总结

Python 2 的网络爬虫

用python爬虫批量下载pdf的实现

python爬虫十八：mongodb的简单介绍

FileNotFoundError: [Errno 2] No such file or directory: ‘errors.out‘ （python自然语言处理 5.6 最后的示例报错）

使用selenium框架的Python爬虫被检测到的 解决方法

4g升5g的利与弊有哪些（2种网络对比分析）

python爬虫分布式获取数据的实例方法

网络爬虫的原理,该怎么解决

解决python和pycharm安装gmpy2 出现ERROR的问题

疯狂上涨的Python 开发者应从2.x还是3.x着手?

Python标准库urllib2的一些使用细节总结

使用selenium框架的Python爬虫被检测到的解决方法