爬虫基础5：urllib库使用

程序员文章站 2022-05-03 20:06:09

...

爬虫

基础5：urllib库使用

	urllib.request.urlopen() 模拟浏览器向服务器发送请求
	response    服务器返回的数据
		response的数据类型是HttpResponse
		字节-->字符串
				解码decode
		字符串-->字节
				编码encode
		read()       字节形式读取二进制   扩展：rede(5)返回前几个字节
		readline()   读取一行
		readlines()  一行一行读取 直至结束
		getcode()    获取状态码
		geturl()     获取url
		getheaders() 获取headers
	urllib.request.urlretrieve()
		请求网页
		请求图片
		请求视频

请求对象的定制

UA介绍：User Agent中文名为用户代理，简称 UA，它是一个特殊字符串头，使得服务器能够识别客户使用的操作系统及版本、CPU 类型、浏览器及版本。浏览器内核、浏览器渲染引擎、浏览器语言、浏览器插件等

语法：request = urllib.request.Request()

扩展：编码的由来

'''编码集的演变---
由于计算机是美国人发明的，因此，最早只有127个字符被编码到计算机里，也就是大小写英文字母、数字和一些符号，
这个编码表被称为ASCII编码，比如大写字母A的编码是65，小写字母z的编码是122。
但是要处理中文显然一个字节是不够的，至少需要两个字节，而且还不能和ASCII编码冲突，
所以，中国制定了GB2312编码，用来把中文编进去。
你可以想得到的是，全世界有上百种语言，日本把日文编到Shift_JIS里，韩国把韩文编到Euc-kr里，
各国有各国的标准，就会不可避免地出现冲突，结果就是，在多语言混合的文本中，显示出来会有乱码。
因此，Unicode应运而生。Unicode把所有语言都统一到一套编码里，这样就不会再有乱码问题了。
Unicode标准也在不断发展，但最常用的是用两个字节表示一个字符（如果要用到非常偏僻的字符，就需要4个字节）。
现代操作系统和大多数编程语言都直接支持Unicode。'''

编解码

get请求方式：urllib.parse.quote（）

eg：
import urllib.request
import urllib.parse

url = 'https://www.baidu.com/s?wd='

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}

url = url + urllib.parse.quote('小野')

request = urllib.request.Request(url=url,headers=headers)

response = urllib.request.urlopen(request)

print(response.read().decode('utf-8'))

get请求方式：urllib.parse.urlencode（）

eg:
import urllib.request
import urllib.parse
url = 'http://www.baidu.com/s?'
data = {
    'name':'小刚',
    'sex':'男',
}
data = urllib.parse.urlencode(data)
url = url + data
print(url)
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
request = urllib.request.Request(url=url,headers=headers)
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

post请求方式

eg:百度翻译
import urllib.request
import urllib.parse
url = 'https://fanyi.baidu.com/sug'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
keyword = input('请输入您要查询的单词')
data = {
    'kw':keyword
}
data = urllib.parse.urlencode(data).encode('utf-8')
request = urllib.request.Request(url=url,headers=headers,data=data)
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

总结：get和post区别？

1：get请求方式的参数必须编码， 参数是拼接到url后面，编码之后不需要调用encode方法
2：post请求方式的参数必须编码，参数是放在请求对象定制的方法中，编码之后需要调用encode方法

爬虫基础5：urllib库使用

爬虫

基础5：urllib库使用

请求对象的定制

编解码

get请求方式：urllib.parse.quote（）

get请求方式：urllib.parse.urlencode（）

post请求方式

总结：get和post区别？

Python2和Python3中urllib库中urlencode的使用注意事项

html5本地存储之localstorage 、本地数据库、sessionStorage简单使用示例

Oracle数据库操作---基础使用(二)

python爬虫之urllib库常用方法用法总结大全

Python标准库urllib2的一些使用细节总结

使用Python的urllib和urllib2模块制作爬虫的实例教程

Python中使用urllib2模块编写爬虫的简单上手示例

零基础写python爬虫之使用Scrapy框架编写爬虫

零基础写python爬虫之urllib2使用指南

零基础写python爬虫之使用urllib2组件抓取网页内容