Python爬虫之Urllib库使用(一)：爬取、保存页面、获取请求信息

程序员文章站 2022-05-03 20:03:39

...

import urllib.request

一、介绍

urllib是Python内置的HTTP请求库，其包括以下模块：

urllib.request：请求模块
urllib.error：异常处理模块
urllib.parse：url解析模块
urllib.robotparser：robot.txt解析模块

二、爬虫指定URL

with urllib.request.urlopen("http://www.baidu.com") as file:
    data = file.read() # 读取全部
    line = file.readline() # 读取一行
    lines = file.readlines() #将全部文件组成一个按行组成的列表并返回

三、下载页面到本地

1.将读取的数据保存到文件中

with open("./1.html","wb") as f:
    f.write(data)

2.使用urlretrive直接下载到本地

filename = urllib.request.urlretrieve("http://www.baidu.com","./2.html")

file.info()

<http.client.HTTPMessage at 0x1170c95be0>

四、获取请求信息

1.获取状态码

file.getcode()

2.获取url

file.geturl()

'http://www.baidu.com'

3.获取头部信息

file.getheaders()

[('Date', 'Mon, 09 Apr 2018 17:11:24 GMT'),
 ('Content-Type', 'text/html; charset=utf-8'),
 ('Transfer-Encoding', 'chunked'),
 ('Connection', 'Close'),
 ('Vary', 'Accept-Encoding'),
 ('Set-Cookie',
  'BAIDUID=4B4DEF37A228ED2722DF818D3F4A6C29:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'),
 ('Set-Cookie',
  'BIDUPSID=4B4DEF37A228ED2722DF818D3F4A6C29; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'),
 ('Set-Cookie',
  'PSTM=1523293884; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'),
 ('Set-Cookie', 'BDSVRTM=0; path=/'),
 ('Set-Cookie', 'BD_HOME=0; path=/'),
 ('Set-Cookie', 'H_PS_PSSID=1430_21090_22160; path=/; domain=.baidu.com'),
 ('P3P', 'CP=" OTI DSP COR IVA OUR IND COM "'),
 ('Cache-Control', 'private'),
 ('Cxy_all', 'baidu+230416a5fbb4a587682dea3e4efe4e59'),
 ('Expires', 'Mon, 09 Apr 2018 17:11:05 GMT'),
 ('X-Powered-By', 'HPHP'),
 ('Server', 'BWS/1.1'),
 ('X-UA-Compatible', 'IE=Edge,chrome=1'),
 ('BDPAGETYPE', '1'),
 ('BDQID', '0xab6114e500016321'),
 ('BDUSERID', '0')]

五、URL中特殊字符处理

使用quote进行编码，再使用unquote进行解码

s = urllib.request.quote("http://www.baidu.com")
s

'http%3A//www.baidu.com'

urllib.request.unquote(s)

'http://www.baidu.com'

上一篇：【进程 07】单进程拷贝文件与多进程拷贝文件的优劣（适用情况）

下一篇： Vue中的动画