Python爬虫学习--urllib库

程序员文章站 2022-05-03 20:03:51

...

注：python 3.x中urllib库和urilib2库合并成了urllib库

urllib2.urlopen()变成了request.urlopen()
urllib2.Request()变成了request.Request()

导入库

from urllib import request

爬取网页获取内容

response = request.urlopen('http://www.baidu.com')
print(response.read())

首先我们调用的是urllib2库里面的urlopen方法，传入一个URL，urlopen一般接受三个参数，它的参数如下：

request.urlopen(url,data,timeout)

第一个参数url即为URL，第二个参数data是访问URL时要传送的数据，第三个timeout是设置超时时间。(url参数必选)

read()

read()方法返回获取到的网页内容

Request

re = request.Request('http://www.baidu.com')
response = request.urlopen(re)
print(response.read())

运行结果和上面完全一样的，只不过中间多了一个Request对象，推荐大家这么写，因为在构建请求时还需要加入好多内容，通过构建一个Request，服务器响应请求得到应答，这样显得逻辑上清晰明确。

带参数的数据请求

from urllib import parse   // 导入模块

values = {'username':"wu",'password':'******'}
data = parse.urlencode(values)
print(data)
re = request.Request('http://www.baidu.com',data=data)
response = request.urlopen(re)
print(response.read())

设置Headers

有些网站直接用上面的方式进行访问会失败，我们模拟浏览器的工作，所以应设置Headers属性。
打开浏览器，右击-检查-Network，重新刷新网页点击Name中第一个，展示出信息中最后User-Agent就是我们所要的headers信息。下面我们来设置Headers：

header = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
re = request.Request('http://www.baidu.com',headers=header)
response = request.urlopen(re)
print(response.read())

设置代理和timeout

很多网站它会检测一段时间内某个IP 的访问次数，如果访问次数过多，它会禁止此IP访问，所以设置代理服务器来帮助工作，每隔一段时间换一个代理

proxy = request.ProxyHandler({'http':'http://www.xicidaili.com/wn/'})
opener = request.build_opener(proxy)
request.install_opener(opener)
re = request.Request('http://www.baidu.com',headers=header,timeout=10)
response = request.urlopen(re)
print(response.read())

上一篇： underscore.js 170 -- 293 行

下一篇：【进程 07】单进程拷贝文件与多进程拷贝文件的优劣（适用情况）

Python爬虫学习--urllib库

导入库

爬取网页获取内容

Request

带参数的数据请求

设置Headers

设置代理和timeout

基于Python的PIL库学习详解

python网络爬虫学习笔记（1）

python爬虫学习之用Python抢火车票的简单小程序

python网络编程学习笔记(九)：数据库客户端 DB-API

python机器学习库xgboost的使用

python3第三方爬虫库BeautifulSoup4安装教程

详解Python3网络爬虫(二)：利用urllib.urlopen向有道翻译发送数据获得翻译结果

python爬虫的数据库连接问题【推荐】

python3爬虫学习之数据存储txt的案例详解

python机器学习库常用汇总