基本库的使用

程序员文章站 2022-05-03 21:35:57

...

第三章、基本库的使用
3.1 使用urllib
urllib是python内置的HTTP请求库，也就是不需要额外安装即可使用，它包含4个模块。

request：模拟发送请求
error
parse：提供许多URL处理方法，比如拆分、解析、合并等
robotparser：主要是用来识别网站的robots.txt文件，判断哪些网站可以爬

3.1.1 发送请求
1.urlopen()（urllib中的方法）

import urllib.request
response=urllib.request.urlopen('https://www.baidu.com')
print(response.read().decode('utf-8')) #返回网页内容。去掉decode('utf-8')未出现乱码，但有很多\n或\r\n\t,且前面有b'内部为网页源代码'

通过print(type(response)),得到它是HTTPResponse类型的对象，主要包括read()、readinto()、getheaders()、getheader(name)等方法，和msg、status、closed等属性。把它赋值为response变量后，就可以调用这些方法和属性

import urllib.request
response=urllib.request.urlopen('https://python.org')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))

运行结果

200
[('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'SAMEORIGIN'), ('x-xss-protection', '1; mode=block'), ('X-Clacks-Overhead', 'GNU Terry Pratchett'), ('Via', '1.1 varnish'), ('Content-Length', '48747'), ('Accept-Ranges', 'bytes'), ('Date', 'Mon, 04 Jun 2018 12:14:17 GMT'), ('Via', '1.1 varnish'), ('Age', '2309'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2126-IAD, cache-hnd18722-HND'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '48, 82'), ('X-Timer', 'S1528114457.193407,VS0,VE1'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
nginx

如果想给链接传递一些参数，该怎么实现？看一下urlopen()函数的API：

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=none。。。。)

下面详细介绍这几个参数

data参数
该参数可选，如果要添加该参数，并且如果它是字节流编码格式的内容，即bytes类型，则需要通过bytes()方法转化。如果传递了这个参数，它的请求方式就不再是GET方式，而是POST方式。

import urllib.parse
import urllib.request
data =bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8') #urlencode()将参数字典转化为字符串
response=urllib.request.urlopen('https://httpbin.org/post',data=data)
print(response.read()) #加decode('utf-8')代码变规范

运行结果

b'{"args":{},"data":"","files":{},"form":{"word":"hello"},"headers":{"Accept-Encoding":"identity","Connection":"close","Content-Length":"10","Content-Type":"application/x-www-form-urlencoded","Host":"httpbin.org","User-Agent":"Python-urllib/3.6"},"json":null,"origin":"119.39.127.110","url":"https://httpbin.org/post"}\n'

httpbin.org可以提供HTTP请求测试，其中form字段的值为data中的数据

timeout参数
如果不使用该参数就会使用全局默认时间。可以通过设置这个超时时间来控制一个网页如果长时间没有响应，就跳过它的抓取，利用try except语句实现，代码如下：

import socket
import urllib.request
import urllib.error
try:
    response=urllib.request.urlopen('https://httpbin.org/get',timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout): #socket.timeout类型就是超时异常，判断e.reason是否是socket.timeout类型的错误，isinstance考虑继承，type不考虑继承
        print('TIME OUT')

2.Request（urllib中的类）
如果请求中需要加入Headers等信息，就需要更强大的Request类来构建
明天写。。。。

基本库的使用

Java 使用poi把数据库中数据导入Excel的解决方法

使用多点免费wifi让笔记本变成无线路由器的方法介绍

使用Maven配置Spring的方法步骤

mysql模糊查询like与REGEXP的使用详细介绍

C#使用ILGenerator动态生成函数的简单代码

java日期格式化SimpleDateFormat的使用详解

C#使用JavaScriptSerializer序列化时的时间类型处理

java使用正则表达式判断手机号的方法示例

爱剪辑新手使用怎么查看帮助? 爱剪辑新手帮助的使用方法

详解C# partial 关键字的使用