python3网络爬虫系统学习：第一讲基本库urllib

程序员文章站 2022-05-21 17:36:07

在python3中爬虫常用基本库为urllib以及requests 本文主要描述urllib的相关内容 urllib包含四个模块：requests——模拟发送请求 error——异常处理模块 parse——关于URL处理方法的工具模块 robotparser——通过识别网站robot.txt判断网站 ......

在python3中爬虫常用基本库为urllib以及requests

本文主要描述urllib的相关内容

urllib包含四个模块：requests——模拟发送请求

　　　　　　　　　error——异常处理模块

　　　　　　　　　parse——关于url处理方法的工具模块

　　　　　　　　　robotparser——通过识别网站robot.txt判断网站的可爬取内容

一、发送请求

　　urllib库发送请求主要使用request模块中的两个内容：urlopen()方法以及requests类，其中requests类是结合urlopen()方法来使用的。

　　首先，看一下urlopen()方法的api：

urllib.request.urlopen(url,data=none,[timeout,]*,cafile=none,capath=none,context=none)

　　　　参数详细介绍：url——必填内容

　　　　　　　　　　　data——post方式请求时，需要传递该参数。传递该参数时需要注意要将参数转化为字节流编码格式（bytes类型）

　　　　　　　　　　　timeout——设置超时时间，当超出该时间时，抛出超时异常。使用该参数有两种方式：timeout=5或者timeout=(5,30)，前者表示connect+read时间为5s，后者表示connect时间为5s+read时间30s

　　　　　　　　　　　context——参数值必须是ssl.sslcontext类型

　　　　　　　　　　　cafile,capath——指定ca整数以及路径

　　　　注：bytes()方法的使用——bytes(string,code) 第一个参数用来指定字符串，第二个参数用来指定编码格式

　　　　　　将dict转化为string的方法 urllib.parse.urlencode(dict)

　　实例应用：

 1 import urllib.parse
 2 import urllib.request
 3 
 4 url = 'http://httpbin.org/post'
 5 data = bytes(urllib.parse.urlencode({'name':'value'}), encoding='utf8')
 6 timeout = (3, 10)
 7 
 8 response = urllib.request.urlopen(url,data=data,timeout=timeout)
 9 
10 # 输出response的类型
11 print(type(response))
12 # 输出网页内容
13 print(response.read().decode('utf8'))

　　通过type(response)，我们发现urlopen()是返回一个httpresponse类型对象，该对象主要包括以下方法和属性

　　　　read()——返回网页内容

　　　　getheaders()——返回响应头信息

　　　　getheader(name)——返回属性名为name的响应头中name对应的属性值

　　　　msg、version、status（状态码）、reason、debuglevel、closed

　　接下来，看一下request类构建方法，该方法主要是解决urlopen()方法不容易解决的请求构建问题，例如添加headers等信息

　　request类的api：

urllib.request.request(url,data=none,headers={},origin_req_host=none,unverifiable=false,method=none)

　　url——必传参数

　　data——bytes()类型

　　headers——字典，请求头信息，常用user-agent信息来伪装请求头

　　origin_req_host——请求方的host方法或者ip地址

　　unverifiable——指请求是否是无法认证的，当我们没有抓取权限时，该参数的值为true，默认为false

　　method——参数值为字符串，指定请求使用的方法，比如get、post等

　　实例应用：

 1 from urllib import request,parse
 2 
 3 url = 'http://httpbin.org/post'
 4 headers = {
 5     'user-agent': 'mozilla/4.0 (compatible; msie 5.5; windows nt)'
 6     'host': 'httpbin.org'
 7 }
 8 dict = {
 9     'name': 'germey'
10 }
11 data = bytes(parse.urlcode(dict), encoding='utf8')
12 
13 req = request.request(url=url,data=data,headers=headers,method='post')
14 response = request.urlopen(req)

　　此外，对于一些更高级的操作（cookies处理、代理设置等）需要借助handler工具

　　在urllib.request模块中basehandler类提供了几种最基本的方法：default_open()、protocol_request()等，其他各种handler子类都会继承该类。

　　httpdefaulterrorhandler：处理http响应错误，错误抛出httperror类型异常

　　httpredirecthandler：处理重定向

　　httpcookieprocessor：用于处理cookies

　　proxyhandler：用于设置代理，默认代理为空

　　httppasswordmgr：用于管理密码，维护了用户名和密码的表

　　httpbasicauthhandler：用于管理认证，打开一个链接时需要认证时则使用该子类

　　使用以上子类时，需要借助openerdirector类，可以称为opener。上面讲述的urlopen()方法实际上就是urllib为我们提供的opener，当我们使用上述子类的操作时，就需要借助handler来构建opener

　　实例应用：

# 认证
from urllib.request import httppasswordmgrwithdefaultrealm, httpbasicauthhandler, build_opener
from urllib.error import urlerror

username = 'username'
password = 'password'
url = 'http://localhost:5000/'

# 构建密码管理
p = httppasswordmgrwithdefaultrealm()
p.add_password(none, url, username, paaword)
# 构建认证管理
auth_handler = httpbasicauthhandler(p)
opener = build_opener(auth_handler)

try:
    response = opener.open(url)
    html = response.read(),decode('utf8')
    print(html)
except urlerror as e:
    print(e.reason)


# 代理
from urllib.request import proxyhandler, build_opener

proxy_handler = proxyhandler({
    'http': 'http://127.0.0.1:9743'
    'https': 'https://127.0.0.1:9743'
})
opener = builder_opener(proxy_handler)


# cookies获取
import http.cookiejar, urllib.request
cookie = http.cookiejar.cookiejar()
handler = urllib.request.httpcookieprocessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)
for item in cookie:
    print(item.name+'='+item.value)
# cookie输出到文件
filename = 'cookies.txt'
cookie = http.cookiejar.mozillacookiejar(filename)   #保存成mozilla型浏览器的cookie格式
cookie = http.cookiejar.lwpcookiejar(filename)   #保存为lwp格式
handler = urllib.request.httpcookieprocessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)
cookie.save(ignore_discard=true, ignore_expires=true)
# 从文件中读取cookie
cookie = http.cookiejar.lwpcookiejar()
cookie.load(filename, ignore_discard=true, ignore_expires=true)
handler = urllib.request.httpcookieprocessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)

二、异常处理

　　通过合理的捕获异常可以做出更准确的异常判断，使得程序更加稳健。　

　　常用的异常处理类有urlerror类和httperror类，其中httperror类是urlerror类的子类。有request模块产生的异常都可以通过捕获urlerror类来实现，它有一个属性reason，返回错误的原因。httperror专门用来处理http请求错误，它有三个属性：code、reason以及headers，code返回http状态码，reason返回错误原因，headers返回请起头。

　　reason属性返回可能是字符串，也可能是对象。

　　在具体使用时，我们可以先选择捕获子类错误，再选择捕获父类错误。

　　实例应用：

 1 from urllib import request,error
 2 '''
 3 上述引用就相当于
 4 import urllib.request
 5 import urllib.error
 6 '''
 7 
 8 url = 'http://www.baidu.com'
 9 try:
10     response = request.urlopen(url)
11 except error.httperror as e:
12     print(e.reason, e.code, e.headers, sep='\n')
13 except error.urlerror as e:
14     print(e.reason)
15 else:
16     print('reuqest successfully')

三、url链接解析

　　urllib库里面提供的parse模块，它定义了处理url标准接口，下面我们介绍一下该模块的常用方法。

　　urlparse()——实现url识别和分段，该方法会将url拆分成6个部分。分别是：scheme（协议）、netloc（域名）、path（访问路径）、params（参数）、query（参数）以及fragment（锚点），这六个元素构成的url链接格式为

　　　　　　　　　　scheme：//netloc/path;params?query#fragment

　　　　　　　　urlparse()的api用法如下：

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=true)

　　参数详解：

　　　　urlstring：必填项，待解析的url

　　　　scheme：选填项，默认的协议，当链接没有协议信息是，会使用该默认协议

　　　　allow_fragments：选填项，该参数为是否fragment。当fragment=false时，原有的fragement部分就会被解析为 path、params或者query的一部分，而fragment部分会变为空

　　此外，urlparse()的返回值是一个元组，我们可以根据索引顺序来获取我们需要的内容，也可以根据属性名来获取

　　实例应用：

1 from urllib.parse import urlparse
2 
3 url = 'http://www.baidu.com/index.html#comment'
4 result = urlparse(url,allow_fragments=false)
5 
6 print(result.scheme, result[0], sep='\n')

　　接下来，我们看一下有关链接解析的其他方法：

　　urlunparse()——根据参数的到链接，相当于urlparse()方法的逆转。值得注意的是，此方法的可迭代对象参数的长度必须为6

from urllib.parse import urlunparse

data = ['http',www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))

　　urlsplit()——与urlparse()类似，也是用来分解url的。不同的是，该方法的返回值长度仅为5，其中，不再单独解析params，而是归到path里面

　　urlunsplit()——与urlunparse()类似，是urlsplit()的逆转，参数长度为5

　　urljoin()——生成链接的另外一种方法，api如下：

urllib.parse.urljoin(base_url,url)

　　base_url：基础链接。该方法会解析base_url的scheme、netlocal以及path

　　url：待处理的新链接。其形式多样化，可包含url的6个部分的全部内容，也可以是仅包含其中某几个连续的内容。

　　当新链接内容缺失时，该方法会根据base_url的解析信息对其缺失部分进行补充，并返回补充后的新链接或不需补充的待处理链接

　　应用实例：

from urllib.parse import urljoin
urljoin('http://baidu.com/index.html','faq.html')

　　值得注意的是，即使base_url中包含params、query以及fragment部分，也是不起任何作用的

　　urlencode()——序列化get请求参数。所谓序列化就是将字典转化为参数所需的字符串

1 from urllib.parse import urlencode
2 
3 params = {
4     'name': 'germey'
5     'age': 22       
6 }
7 base_url = 'http://www.baidu.com'
8 url = base_url+urlencode(params)
9 print(url)

　　parse_qs()——反序列化

　　parse_qsl()——将参数化为元组组成的列表

1 from urllib.parse import parse_qs,parse_qsl
2  
3 query = 'name=germey&age=22'
4 print(parse_qs(query))
5 print(parse_qsl(query))

　　quote()——将中文字符转化为url编码

　　unquote()——将url解码

1 from urllib.parse import quote
2 
3 keyword = '壁纸'
4 url = 'http://www.baidu.com/s?wd=' + quote(keyword)
5 print(url)
6 print(unquote(url))

四、robots协议

　　经过这么久的学习，终于到了urllib库的最后一个模块robotparser模块，通过该模块我们可以实现网站robots协议的分析

　　首先，我们了解一下什么是robots协议

　　robots协议也称为爬虫协议，全名是网络爬虫排除标准。用途是来告诉爬虫和搜索引擎哪些页面可以抓取，而哪些不可以。通常是一个叫做robots.txt的文本文件。通常放在网站的根目录下。当搜索爬虫访问站点时，会首先检查是否存在该文件，若存在，再根据定义的爬取范围来爬取信息。

　　robots.txt范例

user-agent: *
disallow: /
allow: /public/

　　其中，user-agent描述了搜索爬虫的名称，其值也可以是baiduspider、googlebot等；disallow指定了不允许抓取的目录，‘/’表示不允许抓取所有页面；allow用来排除某些例外，通常和disallow结合使用，/public/表示可以抓取public目录，相当于白名单的作用

　　了解robots协议后，我们可以通过robotparser模块来解析robots.txt。robotparser模块api如下：

urllib.robotparser.robotfileparser(url='')

　　我们可以在使用该模块时，直接传入url，也可以通过set_url方法来设置。我们看一下该模块的常用方法：

　　set_url()——设置robots.txt文件位置链接

　　read()——读取robots.txt并进行分析，值得注意的是，我们一定要执行该方法以完成对文件的读取，虽然该方法并没有返回值

　　parse()——解析robots.txt文件，传入参数是robots.txt某些行的内容

　　can_fetch()——该方法传入两个参数，第一个是user-agent，第二个是要抓取的url。返回布尔型结果，表示user-agent是否可以抓取该页面

　　mtime()——上次抓取和分析robots.txt的时间，目的在于定期检查robots.txt文件

　　modtime()——将当前时间设置为上次抓取和分析时间

　　实例应用：

1 from urllib.robotparser import robotfileparser
2 
3 rp = robotfileparser()
4 rp.set_url('http://www.jianshu.com/robots.txt')
5 rp.read()
6 # rp.parse(rp.read().decode('utf8').split('\n'))
7 print(rp.can_fetch('*','http://www.jianshu.com/p/b67554025d7d'))
8 print(rp.can_fetch('*','http://www.jianshu.com/search?q=python&page=1&type=collections'))

　　小编终于写完该部分的内容了，准备去休息啦。

　　在这里，小编推一下自己新建的公众号，欢迎大家积极前来探讨问题。

python3网络爬虫系统学习：第一讲基本库urllib

上一篇： win10环境下如何运行debug

下一篇： Warning: session_destroy() : Trying to destroy uninitialized sessionq错误

python3网络爬虫系统学习：第一讲 基本库urllib