欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Python 2 的网络爬虫

程序员文章站 2022-05-04 11:28:54
...

闲的蛋疼准备爬一下黄网的图片,结果最后代码写完获取到图片的资源的时候发现下不下来,只可以得到403,头疼-,-所以我的小计划破碎了,唉。分享一下代码吧,实际效果只能挨个打开,不能做到自动下载, urlretrieve函数下下来全是4k的打不开的图片= -

#coding=utf-8
import urllib
import urllib2
from bs4 import BeautifulSoup
import re
head = 'http://www.1111kf.com%s'
urlFloor0 = 'http://www.1111kf.com/artlist/25-%d.html'
i = 2
UserAgent = 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0'
headers = {'User-Agent':UserAgent}
urlFloor1 = []
k = 0
while (True):
    raw = raw_input('输入Enter表示进行')
    if raw == '':
        url = urlFloor0%i
        i+=1
        req = urllib2.Request(url=url,headers=headers)
        content = urllib2.urlopen(req).read()
        BS = BeautifulSoup(content,'html.parser')
        for strint in BS.find_all(target="_blank"):
            conment_id = str(strint.get('href'))
            if (conment_id.find('http')):
                urlFloor1.append(conment_id)
        for m in urlFloor1:
            pic_src = []
            url1 = head%m
            req = urllib2.Request(url=url1, headers=headers)
            content1 = urllib2.urlopen(req).read()
            pattern = re.compile('img.*?src.*?=.*?(http.*?jpg)')
            items = re.findall(pattern, content1)
            for item in items:
                print item






    else :
        break