用urllib爬虫
程序员文章站
2022-05-03 20:04:21
...
1.初识urllib
urllib库包含以下模块:
- urllib.request——打开和读取 URLs
- urllib.error——urllib.request异常处理
- urllib.parse——解码URLs
- urllib.robotparser——解码robots.txt
2.urllib爬虫
2.1 简单的get方法
简单粗暴,容易被封:
from urllib import request,parse
#直接爬取
url = "http://httpbin.org/"
string = request.urlopen(url).read().decode('utf8')
print(string)
加上headers,伪装爬虫:
url = "http://httpbin.org/"
headers = {
'Host': 'httpbin.org',
'Referer': 'http://httpbin.org/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
req = request.Request(url, headers=headers, method='GET')
string = request.urlopen(req).read().decode('utf8')
print(string)
以上两种都会返回网页源代码。
2.2高级一点的post
带上data,发送数据:
from urllib import request, parse
url = 'http://httpbin.org/post'
headers = {
'Host': 'httpbin.org',
'Referer': 'http://httpbin.org/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
dict = {
'name': 'abc',
'password': '123'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
string = request.urlopen(req).read().decode('utf8')
print(string)
返回一个json文件:
{
"args": {},
"data": "",
"files": {},
"form": {
"name": "abc",
"password": "123"
},
"headers": {
"Accept-Encoding": "identity",
"Connection": "close",
"Content-Length": "21",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"Referer": "http://httpbin.org/",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
},
"json": null,
"origin": "202.119.46.99",
"url": "http://httpbin.org/post"
}
如果使用json可以读取其中内容:
j = json.loads(string)
print(j['form']['name'])
#输出:
abc
2.3使用cookie
第一步:获取cookie
import http.cookiejar
from urllib import request
cookie = http.cookiejar.CookieJar()
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
print(item.name+"="+item.value)
得到返回的cookie:
BAIDUID=112C1EAFD************1B0E3DC9:FG=1
BIDUPSID=112C************F31C931B0E3DC9
H_PS_PSSID=
PSTM=15*****188
delPer=0
BDSVRTM=0
BD_HOME=0
第二步:保存cookie到本地
(1)MozillaCookieJar方式
import http.cookiejar
from urllib import request
filename = 'cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)
(2)LWPCookieJar方式
import http.cookiejar
from urllib import request
filename = 'cookie1.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)
两者得到的cookie文件有差异
第三步:使用已保存的cookie
保存的哪种类型cookie,就用哪种类型再打开读取:
import http.cookiejar
from urllib import request
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie1.txt', ignore_discard=True, ignore_expires=True)
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))
将会返回网页源代码
上一篇: Python爬虫:urllib.request库下载图片
下一篇: Python 爬虫爬取网页