python网络爬虫教程(四):详解requests库
上一章中,我们了解了urllib的基本用法,详情可浏览如下链接python网络爬虫教程(三):详解urllib库,但其中确实有不方便的地方,为此,我们可以使用更方便更简洁的HTTP请求库requests来帮我们完成爬虫任务。
如果你没有安装requests,无论是Windows、Linux还是Mac,都可以在命令行界面中运行如下命令,即可完成requests库的安装:
pip install requests
如果你没有安装pip,可参考以下文章:写给初学者的Python与pip安装教程。
基本用法
1. GET请求
1. 基本实例
HTTP中最常见的就是GET请求,在requests中可以用如下方法实现:
import requests
response = requests.get('http://httpbin.org/get')
print(type(response))
print(response.text)
结果如下:
<class 'requests.models.Response'>
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.23.0",
"X-Amzn-Trace-Id": "Root=1-5eccc552-e9bfd8204c6d591075a2b890"
},
"origin": "171.107.139.104",
"url": "http://httpbin.org/get"
}
可以发现,get()方法返回值是requests.model.Response类型对象,并且这样的请求方法与urllib中的urlopen()方法如出一辙。
如果要添加参数,除了在URL中构造外,还可以使用get()方法的params参数:
import requests
data = {
'name': 'germey',
'age': 22
}
response = requests.get('http://httpbin.org/get', params=data)
print(response.text)
运行结果如下:
{
"args": {
"age": "22",
"name": "germey"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.23.0",
"X-Amzn-Trace-Id": "Root=1-5eccc6f8-ee3d14ec71ec38bec92961aa"
},
"origin": "171.107.139.104",
"url": "http://httpbin.org/get?name=germey&age=22"
}
我们构造一个字典类型的参数data,利用params参数将data构造为URL的参数,这种方法更加简便,且可读性更强。
另外,网页的返回结果是str类型,但是它很特殊,是JSON格式的,如果我们要解析返回结果,得到一个字典的话,可以用jsos()方法:
import requests
response = requests.get('http://httpbin.org/get')
print(type(response.text))
print(response.json())
print(type(response.json()))
运行结果如下:
<class 'str'>
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.23.0', 'X-Amzn-Trace-Id': 'Root=1-5eccc86b-7372a90f1143321aa6393206'}, 'origin': '171.107.139.104', 'url': 'http://httpbin.org/get'}
<class 'dict'>
可以发现,调用json()后,返回结果转化为了字典。但是如果返回结果不是JSON格式,便会出现解析错误。
2. 抓取二进制数据
如果我们要获取网页内容,可以在返回结果的Request中用text属性,实际上它的内容是一个HTML文档,如果想抓取图片、音频、视频等内容,可以使用Response的content属性,它的内容是这些文件的二进制码。
下面我们来试着爬取本页面菜单栏上的CSDN图标,如下所示
import requests
response = requests.get('https://csdnimg.cn/cdn/content-toolbar/csdn-logo.png?v=20200416.1')
print(type(response.text))
print(type(response.content))
运行结果如下:
<class 'str'>
<class 'bytes'>
其中content属性的内容是bytes类型,试着将它保存到文件中:
import requests
response = requests.get('https://csdnimg.cn/cdn/content-toolbar/csdn-logo.png?v=20200416.1')
with open('text.ico', 'wb+') as f:
f.write(response.content)
可以看到本目录中保存的文件就是我们要提取的图片:
同样地,音频和视频文件也可以用这样的方法提取。
3. 添加headers
与urllib.request一样,我们也可以通过headers参数来传递头信息,下面以知乎的发现页面为例,如果不加请求头:
import requests
response = requests.get('https://www.zhihu.com/explore')
print(response.text)
结果如下:
<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>openresty</center>
</body>
</html>
我们发现访问被拒绝了,原因是如果我们不加请求头,目标服务器识别出我们的请求并非是由浏览器发出来的,于是拒绝我们的访问,我们可以在请求头的User-Agent字段中伪造请求源,令请求源为Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Mobile Safari/537.36,表示该请求是由Chrome浏览器发出来的,这样就成功实现了伪装,我们将上例代码改成以下所示:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Mobile Safari/537.36'
}
response = requests.get('https://www.zhihu.com/explore', headers=headers)
print(response.text)
运行结果为:
成功访问了页面。
2. POST请求
了解了GET请求的用法,POST请求的方法也同样简单:
import requests
data = {'name': 'germey', 'age': 22}
response = requests.post('http://httpbin.org/post', data=data)
print(response.text)
结果如下:
"args": {},
"data": "",
"files": {},
"form": {
"age": "22",
"name": "germey"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "18",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.23.0",
"X-Amzn-Trace-Id": "Root=1-5eccd5c6-b1b0d97280f0059458476556"
},
"json": null,
"origin": "171.107.139.104",
"url": "http://httpbin.org/post"
}
3. 响应
在上面的实例中,我们通过text和content属性获得了响应的内容,此外还有很多属性和方法可以获取其他信息,比如状态码、响应头、Cookie等,如下:
import requests
response = requests.get('https://www.baidu.com')
print(type(response.status_code), response.status_code)
print(type(response.headers), response.headers)
print(type(response.cookies), response.cookies)
print(type(response.url), response.url)
print(type(response.history), response.history)
运行结果如下:这里分别打印出status_code属性得到状态码,输出headers属性得到响应头,输出cookies属性得到Cookies,输出url属性得到URL,输出history属性得到请求历史。
高级用法
1. 文件上传
假如有些网站需要上传文件,我们也可以用requests模块实现,如下:
import requests
files = {
'file': open('test.ico', 'rb')
}
response = requests.post('http://httpbin.org/post', files=files)
print(response.text)
结果如下:
{
"args": {},
"data": "",
"files": {
"file": "data:application/octet-stream;base64,iVBORw0KGgoAAAANSUhEUgAAAKAAAABYCAYAAAByDvxZAAATRUlEQVR4Xu1ceXAc1Zn/fW9kyyfGXMa2ZJsjYIORp6dblj09ksWGsEDYJASccGyWLMuxlRAIEFKpQLJhi2yKLJCwCUcqS3EEsoEUZLkWyBIQ0rRkS93TIwMGgrlkYS7b2EaxsKTpb+uNJe+MNOpjJDyTmn5V+mu+633vp9ff+77vPUI4Qg+U0ANUQt2h6tADCAEYgqCkHggBWFL3h8pDAIYYKKkHQgCW1P2h8hCAIQZK6oEQgCV1f6g8BGCIgZJ6IARgSd0fKg8BGGKgpB4IAVhS94fKQwCGGCipB0IAltT9ofIQgCEGSuqBEIAldX+oPARgiIGSeiAEYEndHyoPARhioKQeCAFYUveHykMAhhgoqQdCAJbU/aHyEIAhBkrqgRCAJXV/qDwEYIiBknogBGBJ3R8q/9QA+HbihLkCUxpY0HIClhJ4CRMdDsZhBEwHMA2AYCBDwB4Auxm8k4APmWgLA68Jh//MQKqmLfUiAc7ElmttJBbrPcIR4hg4mCsEZjmE2WCeuVcuDQLI/gk4uxm0iwR/hKHIViLaYllt72PCNuTMYO3aCH7/+8zE5jTp3KQojYcAvIDIOZgF5gJiJoGnM4OIwAzezcCuCIv3AOqxrLZ3J2LFpAHwQSCyulFJEIm/Y8YpAC8DkZiIcSO8BHy8cObOQ+nJTRKovsaKFasXUoROE0RrmBAF42gQqn0xFyTiAQCvM9MLBFov4DxiWe2vjyaNafqlDFrJ4HkEuYA8FyAJ8pkAT2NGFQ37hZkdIhoAoR+M7QBvY8YWIvEG2NlETHZ1tdPd0dHRX7zdYzlVtXF+hpzlxLwMgpZmfQMcAWARgKkBde0CkAKjlZgeTaWSVhD+CQPwjbiyeGoVXQTQ+QzUBFHum5axvbbNOtiLfsmS5mkHHTp0DjMuBHi13Na8eIr+nfkm22r/zmj+qKa/QsCxRcsdyzgEUCeIn3KAh7u7jJeKla1o8XsBOgXAocXK8ORjdBPRbSlz/p2A9w5f9AJtbo4dTQ5dy8C5AKZ4GjYxgnRtq6W4iBCKql8Mwg8ALJiYKn/cTPTNdFfytgIA3EXAbH9SiqGiTiL+j1SX8V9BQ4Kopm8j4KBitAbmIXRHWFxsmm2dbryBAfjeyXUzB/urfgjC5QBN4JPmf0oEfrSmNfXFQhwnaIkjI+DfEtDgX+KkUJ5mm8aTuZKOO655VvWMwY8nRbqnEHqZgW+kzWSLJykAVVWnODRNhjCB19yP/HFoPnEI53d3GQ+OJyOQMZsbtZUg5z6APjMBo4pgpVtrW81LRzOqaqLBIecJgDw/z0UodWXJAEs3mMaruUTRaNNnqCrz58nWNa48GUOCbk5ZxncBsJteGROLKaJ3v9n2/4oyDJyaNo3/LaTbNwDfadIuyIDvoE//c1vIzu/Vtlo35P5Qp+nHRhgdIBno798hDw99O+fM2LTpybxDkVIfbwLT8/vXmuye9lu7y/ia2ydZVVfFHIoEOiBM4jy29ovBpa90dm4bLdMXAHsb1R+AcB1PxvbN2E2EPgfYTcAQgBkAphNwAAORQpNmonMXPW/KmGdkCEXTZWyhFuGkITDeAtADQh8zMiCeStmTKh8wvJvKIF3aVXgw3rEtY8yBS9HiZwOUa2dB/uHTb79MLfHeU+ckhDJ0s20mrxrP5Fh94hRmzgsZivBd8SzEN9ldYw9tngDsbVKvYuDGIjVvIUBuvc9FmF4eGurfVNvx0vaCIFuLSG/visMj1WK+49CRBCxjIMYg3cngS4sNKznC53ehc/Uw8EeAb+uvpmdfNQzPOE1V1TlDYloNObyEAGnPUiYcD5ACRrdtGU2j5xHT9CsYuNnNV8R8S8pqvzJ3t1q7dm2kp6dn5uDg1CUsMscRKOYAZ0q9AfzOJMSpqc62pwvxRFX9fCLc7SmPcA8YbQx63QG/G8mI7ZnM9N1DQwM0a1akemhoYKFDvJzIOQWgs+Tm4SlzL8Guj7ZOmffWWy2f5NK7ArC3Uf08Ex4LGLjKWOSPQjg3Lmyxn/FpnMtmA4HmZkEtLXK3zI6olnicwJ/3KVvac5FtGnf6pPciE3UNzQs2rG8ZE09FNf2nBFztJoCBK9Km8XMvJXvn2agLODexzwMWA29G+JNjLcuSCfW8EdPi32VQXhhT0IYMFtq2scWPfYqiL0AE9wA4yQ89gC/apvGoLwD26MoCiogNAPwH+DKJCucfatrsP/k0qCgyRdM/CJDLesQ2jS8VpSggU1TTf0PA37uz8Tm22f67AKJliukaEP7VDw8RXZDqSt41mlapj98IpnE/0cP0vGf3R9M2btwok+6+xtFHH109+8B58uukeTIwfmJbxvd9AfCdJvU+BzjPU+g+Au6KCDptQYu11T9PcZSKpsvdsGC8OEYi43bbMr5RnKZgXIqWeAbgz7pykWi2u9oCH1QUVf8JCN/zYZFhm0ZiNJ2/fw7ssE0j8KEuqumfo2yI4zkes03jC54A7NEVjSJCBvmeMWJWGKNnKCIajmjpes/ThEkgUDT9Hf8JZ94JiLNsMznhcMDLdEVNbATxMje6QukbL7ny9+bm5qqdfYMyxSNLZuMPZkcgUjO6RqtougTI51xZgVfTprHUjz25NMO27dhbbnTV0Gmb7Xn52oIA29yk3gf/ux8TRfSa5zs7ghpeLL2iJe4FWKYdgowuMD2ECJ6yO5MytHDNmwURPEKraPpHAA50450aGZqzfv16WT8NPGKafgkDd3gxMosz0lbbf+fSKareDUKdB2+LbRoneskv9HtM019hzxIkvWybyeNy+ccAcPtJ6py/7MEHIH9FaQY9tqjVzNtWi5lAEJ46Lb5cMJlUfHPBh8x4ThD9aRB45gUz+UYQ/YVoV69ePf2TQbHbQ85u2zQ8donxJdTVxQ+LTCX5lXH9MjH4R2mz/bo8AGq67Fo53HV/YtyftgyPGLawBJ818Jds01juCsDepthXGPSA3wVhxhmL2qy8/za/vBOh85Py8C2f+TWQeMxhfnzu7CltLTknbr8yZEmwCjymOyaXn4E30qZxlF+ZhegUH0AC4Xa7Ky/ulXlTebDwiJvpBttM+okzx5imaInNAHs1o3TZprHSC4B3MOgSP06SieS+zLSDlvrIq/mRF5QmWq9/hxg/9doRAsrdAsY9AnxnoXar8WTJlAnB2ZerHIeu4AEhiH2KqqdBWOHOQ7+zzeQ5IzTLGxrmTclUecfnzJfZVvsvgtgjaYdjQJnf8zoYPmWbxqnuAGxU25gw5hQ1jlGbalut/VwXzrdEVeMnO0QyLnIPzoN6VTbKEn7jRDLXpdetk5UT1xFVE2uJeNyi+zDzQ7ZpyORt0UPRdNOrAsTAfWkzW5rLjhXaakVApLyV8lm22f6QN10+RTS6aglVRd704mPG3WnL+Ed3ADap7zEwz0uY/J0YLTVtVlFBqx/5fmnq6upmiimzLyPg2yAc5pfPJ10/mK+yrfbb3egVNX4ZiG5xo2Hg1rRpjGmq8GlHlkzR9NeAbAOp2/iVbRr/PEIQW5k4nR2WBQX3IThud7YHPkxGtUQzgZ/zEk/Av6VM4xpXAG5uUmUg7be88nRtqyUbHMtiyINA/4A4jwjyv79xMj/NxPTrlJW8eLyJ+snTEejalJn8cbHOGj7o7PTqv2TwdWmz/Uf7AOj39DyUOSKd9t7tR9sfjennkYDMnLgOAr6VMo1fegFQdnj4assmYF1NqyU7j8tuRFetWkKZqi+AWcYcawL8U407FwJdlzKT+xY2l1Cp1+8G43wPR1w4kZJgLKY3skCrt7Odc22zY19ThKLq14HwQw8+/njH+9M3bfJ/7WFEXlTTryZkY3F3ABLOTHUZD3sBUNYB53sJG/59R02rdfDELwz51FYkmdw5BgbEqgxxkwCtYWBVUYBk7BHg4wsdTqKa/jQBJ7uZyMynp632J4qcBmL1+l3M+LoXv2BnmWV1vDJCp2j6fwL4Jw++bbZpHOIlu9DvUU2/mYArPHkzHLft/E/8mHxST6OaJILuKWyEgJ3m2jY7cGnJt/xPgbBm9erphw1F1jD4bDCfA5CvHT9rCvH1dle7bP3PG1FNf4GAvBzXaBrBQrWsNh+HgbGTrlvZdETEyUhQudrK4N602V6bK0HRdNmG5RUqvWCbhleiuuBqKJoua9tf9VyqDC+x7fa3XXfAnqbYzwj0bU9h+wCIltoyOIj4tncUYSzWWMfCafcuI+1jLFgtUDRd1sDdGzcCdJrkmjlc8G8DUO9jnr+0TeNbowAoKz8nuPIyPW1bSS+QFgagqreCsjG32yjY6DBmB+xdo57IjGd9TDSHhC6sbTUnq90pmOpJoPbz+cxRY9qmkQeEYYDIq5PjVihkE+qBs6dWB01yS9kHzJl3LxO+4meqgp06y+p4IZfWz2UkBt2VNpMX+NExmsbnyXyrbRpjbuONcRgDordJ3RQkr0ZAxiG+aNHzqTFtQMVMaD/zyHan10FY4kcvA8m0aeT9tytKfDEi5J4rZHxgW4av9NaIHcP3OGRVyldIRKAnUmby9Nx5LGlunja3b1BmNlzLd4VSJH78IWkUTf+Lawf5XkEbbNMYk0AvaNQ7zeqFjoNf+zVghI4IjzNnrqxtTctc1aQP+bnMkHNAt2XIT+YEX0rYa15Ui/8LgQqebAtNoFAuT1HiqxEhadP4Y28XddSPU449Vp89YxZdDmJZFvNbOx4gJxNNpda9nKtDVeNHOURyQ3EdBLo0ZSZv9aIb/buqnjTHoX7ZCeM1xlRBJENBALJsj39fNeCzEzdP896bWu0UoQcGM9y+ODJ7I7Xkt2EXslQ+5UFVVUcJxjJmoQmmZxe2mY/k0sa0+DUMup6B7SA8LxgdQw5bU2jqBstqCdSHqKzUNcrg+0w4w8tzub87jL/ptoy8pGusXv8yM9wrCMxP21b7uDFWQ0PDvIFMVQMDawlZm/wCL2seE65Odxljrk4oSrwJEe+LUlQgReLHL7FY0zIWmY1etON94sfdlnsTK45hEVkHUOAGxVHGZEB4C8zbAepjUJ+QmxdjOhPmEOhQRrZ6kX8JiJ2zatvsvEWNavFfEWi8ZPAOAt6UbekE9DBjl7z8xMx9LOTlJzFDMB/GgGwGkLlL+QxF0NFhm0Z8NFNMS3yTwXkJ1gKCZYwoXzXYwow+AbAjaA4xzyVgEQN5J9cghhHhwVSXcXahFjO/92fIwepUylgXRK+kVZTGzyLiePZaEujHKTN57Wj5rnHBuyfWr8lknP9htxtiQS32Se8wJRa3mXIX3jcUNf4UiP7Wp4jJJWP0OQKrCj2NEdMS1zM4r8Q0ucpdpT0m+JMzC90DyQJETVwFYs9LZYKx2LKMnqB2K1riawDf68XHoEvTBT7xnh3PbzdqugA/CtpPTzoMz8Rx6KjFSTOvT0/R4hsBcu049nJEMb8zy9e7+MzxksiKFr8ToKJOkMXYM8wj36v6hd01/0q3N1gULXETwPIWntsIfBdkRJjfy06OQ1/uTiX/EGgHHCHOPkA0RdzH7LtLZgJ+HWYdnDqjdtSrUIqm9wWNjSZqCIPfJ4p81e0eh89E70RNyeXfQg5dkkolH/cS6jNJ/KFtGkU1cUTV+M+J6HIvOzJMqzZYyfVFAVAyZdMzjeoFTLiefHbLeBnl8vuO2lYrL/ZU1eZDHBr8cAIyg7IyA/c7A3zVhg3t8hbeuMNfj15Q9YXoeSeYbtnTP+XfN25skf+MniOq6W0Ej40jwAl9tMKoqj9APnKUgxRZ9GJX6+aiATjCuEVVZwzNwtfJwWWgSX2GbJ9tBLxc02rl3R1YUa8fTw5vGHlbz9PzxRLIzy3RwxC4we5MdvsRo6j6+59CG9iIavmP0EnM9++eRnf7uVSfa7Oi6TKM8eqVLJgi8TN3fwCXF6X2TCsUp3rGgG5GvJ2IqVWCznLAJ2VfDPDuiPUzJ5kberCm1RpTW6yvb6odgnMGmE9hIDGJz6DtYlAbiB6JOJE/BEzpkFKfeBKOcyRAiyb2CGbWPfLCkjwtbyDiZKaKn+vu6JC3AIsaMU3/Gdi91u2AO9OWIS+YBx5Kvf5Dcsg1wc5Av20lx7ylKJVNCIC51m5raDhgT/VAlEksdxjHMGOBAObLVIt8jpeG3z/JfZIXwE6GTM/gPRB6BItXWAykalo2+HnFSayo15eRQytI8FL5jAUzauX7d8yYBcIsGS8yQ4AwQIx+JnxE4G0AbQaylYuNEOi2O+e/6OcxRR+rQ9Fo8xxU9x8qBqsOdoR8AljMFMKZmQFPJYciBPlCqnzqdu+TwMz0McPZ6cDZGslUb06nW/wkdX2Y8tdBMmkA/OuYbmhluXkgBGC5rUiF2RMCsMIWvNymGwKw3FakwuwJAVhhC15u0w0BWG4rUmH2hACssAUvt+mGACy3Fakwe0IAVtiCl9t0QwCW24pUmD0hACtswcttuiEAy21FKsyeEIAVtuDlNt0QgOW2IhVmTwjAClvwcptuCMByW5EKsycEYIUteLlNNwRgua1IhdkTArDCFrzcphsCsNxWpMLsCQFYYQtebtMNAVhuK1Jh9oQArLAFL7fphgAstxWpMHtCAFbYgpfbdEMAltuKVJg9/wdg3rqzVZRXGQAAAABJRU5ErkJggg=="
},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "5134",
"Content-Type": "multipart/form-data; boundary=cb3879890fe3ecc2242d77aed4456739",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.23.0",
"X-Amzn-Trace-Id": "Root=1-5eccd9cb-fd61d63c5b7393a0a4820390"
},
"json": null,
"origin": "171.107.139.104",
"url": "http://httpbin.org/post"
}
前一节中我们保存了一个CSDN图标文件,这次用它来模拟文件上传的过程,从运行结果可以看出,我们提交的文件被放在了files字段内,并且文件被加密成了base64格式。
2. Cookies
前面我们使用urllib处理过Cookies,写法非常复杂,而在requests中,获取和设置Cookie一步即可完成:
import requests
response = requests.get('https://www.baidu.com')
print(response.cookies)
for key, value in response.cookies.items():
print(key + '=' + value)
运行结果如下:
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
BDORZ=27315
我们首先登陆百度取得cookies,可以发现它是RequestsCookieJar类型,然后用items()方法遍历出每个Cookie的名称和值,实现Cookies的遍历和解析。我们也可以用Cookies来维持登陆状态,以知乎为例。
首先,登陆知乎,将Headers中的Cookie复制下来,如图所示:将Cookie复制下来,添加到Headers里面,再发送请求:
import requests
headers = {
'Cookie':'_zap=3f3a8847-850e-43e0-be78-b4be0bfa6995; d_c0="AIDX3Ryz2RCPTmMuf_Mm8rKi-lXSeAtT9EU=|1582267483"; _ga=GA1.2.903862747.1583308568; _xsrf=gJaEH2r1y8UDTP1Ej3IozSme9zAcufUL; tst=r; capsion_ticket="2|1:0|10:1588241682|14:capsion_ticket|44:YTY0NmRlOGM2NGE0NDY4OWJhNjRkNDI3ZDM2MjUyZjc=|2845e561cc3deefa38a0f3af569a97c251a4295347df5a19499c63cd06fc96c1"; r_cap_id="MjljZWU0MDJiZWQ4NDEyN2I1YWU2YWY0MzQyMGVmZTk=|1588241690|9a83e301b5a66e0c3a88b97148b2ae53b2f17438"; cap_id="ODEwZWI3NGM0ZDFjNGM2Yjg1MzhmY2YzN2Y5MzJiNDU=|1588241690|abd8b0789de34c3c9deaaff5a955957633d3f9f1"; l_cap_id="NDhlODVkZGI4MGJiNDMzZGJmMjk0NTcwOTczMmVhNmU=|1588241690|31dedb916b6140599e8cd7efc4c69fde323c7f4b"; z_c0=Mi4xbElYOUR3QUFBQUFBZ05mZEhMUFpFQmNBQUFCaEFsVk5KUE9YWHdDOEt2cFJDUW9qMlFnclZMLWNhRUdzX0YwNE5B|1588241700|b205b28d7e4b0a1bf4e8b554ab73574164eb0c73; q_c1=67dc75e659a445e3aab6b904ca4772ed|1588242255000|1588242255000; __utmv=51854390.110--|2=registration_date=20190531=1^3=entry_date=20190531=1; _gid=GA1.2.28121021.1590480973; __utmc=51854390; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1589720541,1589788910,1590480972,1590484318; BAIDU_SSP_lcr=https://www.baidu.com/link?url=C8qsxEG_x3OMJtP83xu1k1mrHxJPpf20TPUUFxPi7BS&wd=&eqid=ea3bb79e00024eb9000000065eccdd58; SESSIONID=ErO66ceM7PuCJd223PseKNutezkIlS8HiDTqimrnydQ; JOID=Vl8TA0i1_WTj2O72K7GZPsICyaY4844t1umOkmCFrhCmkY3HGNMnrbnc7PEuNDz0unQt3Q6KyGjuLwaRnKm_58E=; osd=V10VB0K0_2Ln0u_0LbWTP8AEzaw58Ygp3OiMlGSPrxKglYfGGtUjp7je6vUkNT7yvn4s3wiOwmnsKQKbnau548s=; __utma=51854390.903862747.1583308568.1590480996.1590484903.3; __utmb=51854390.0.10.1590484903; __utmz=51854390.1590484903.3.3.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/people/jude-78-40; _gat_gtag_UA_149949619_1=1; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1590485091; KLBRSID=0a401b23e8a71b70de2f4b37f5b4e379|1590485093|1590480970',
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Mobile Safari/537.36'
}
response = requests.get('https://www.zhihu.com/people/jude-78-40', headers=headers)
print(response.text)
我们发现运行结果中包含了登陆后的结果:这说明登陆成功。
3. 会话维持
在requests中,如果直接用get()或post()等方法可以做到模拟网页的请求,但实际上这相当于不同的会话,也就是说相当于你用两个浏览器打开了不同的页面,如果你利用post()方法登陆了某个网站,然后再进入个人页面,企图用get()方法获得个人信息,这样显然是不能成功获得个人信息的,因为这是不同的会话,解决这个问题就是要维持同一个会话,我们可以在每次请求时设置同一个Cookie,但是显然这个过程非常繁琐,在requests模块中,提供了一个Session对象来帮我们解决这个问题。httpbin这个网站中,我们可以设置一个Cookie,名称是name,内容为germey,接着再请求当前网址获得Cookie,代码如下:
import requests
requests.get('http://httpbin.org/cookies/set/number/germer')
response = requests.get('http://httpbin.org/cookies')
print(response.text)
运行结果如下:
{
"cookies": {}
}
说明我们并没有获得Cookie,再用Session试试看:
import requests
r = requests.Session()
r.get('http://httpbin.org/cookies/set/number/germer')
response = r.get('http://httpbin.org/cookies')
print(response.text)
运行结果如下:
{
"cookies": {
"number": "germer"
}
}
成功获得了Cookie。说明Session可以维持会话,通常用于模拟登陆后的下一步操作。
4. SSL证书验证
requests提供了证书验证的功能,当发送HTTP请求的时候,它会检查SSL证书,如果访问页面的SSL证书过期或无效,就会出现一个证书问题的页面:在requests中我们可以使用verify参数来控制是否检查此证书,如果不加verify的话,默认是True,会自动验证。例如:
import requests
re = requests.get('https://www.xxxx.cn', verify=False)
print(re.text)
我们也可以指定一个本地证书用作客户端证书,这可以是单个文件,也可以是一个包含两个文件路径的元组:
import requests
re = requests.get('https://www.xxxx.cn', cert=('/path/server.crt', '/path/key'))
print(re.text)
5. 代理设置
对于某些网站,在测试的时候请求几次,能正常获取内容,但是一旦开始大规模爬取,网站可能会弹出验证码,或者跳转到登陆验证页面,更甚者可能会直接封禁客户端IP,导致一段时间内无法访问。
为了防止这种情况发生,我们需要设置代理来解决这个问题,这就需要使用proxies参数:
import requests
proxy = {
'http': '120.25.253.234:812',
'https':'163.125.222.244:8123'
}
re = requests.get('http://httpbin.org/get', proxies=proxy)
print(re.text)
其中代理服务器地址可以换成有效的地址
6. 超时设置
在本机网络状态不好或者服务器网络响应太慢或无法响应时,我们可能会等待特别久的时间才能收到响应,甚至最后接收不到响应,为了防止服务器不能及时响应,应该设置一个超时时间,即超过了这个时间还没有接收到响应,那就报错。者需要用到timeout参数,实例如下:
import requests
r = requests.get('https://www.baidu.com', timeout= 0.1)
print(r.status_code)
如果0.1秒内没有响应,则抛出异常
实际上,请求分为两个阶段,即连接和读取,上面设置的timeout将作用于两个阶段,如需分别指定,可以传入一个元组:
r = requests.get('https://www.baidu.com', timeout = (0.1, 0.5))
如果想永久等待,可以将timeout设为None,或者不指定timeout参数,因为其默认值就是None。
7. 身份验证
在访问网站时,我们可能会遇到这样的验证页面:此时可以使用requests自带的身份验证功能:
import requests
r = requests.get('https://localhost:5000', auth=('username', 'password'))
print(r.status_code)
如果用户名密码正确,验证成功,就会返回200状态码
8. Prepared Request
前面我们介绍urllib时,我们可以将请求表示为一个数据结构,其中各个参数都可以通过一个Request对象来表示,这在requests里同样可以做到:
from requests import Request, Session
url = 'http://httpbin.org/post'
data = {
'name': 'germey'
}
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Mobile Safari/537.36'
}
s = Session()
req = Request('POST', url, headers, data=data)
prepped = s.prepare_request(req)
r = s.send(prepped)
print(r.text)
运行结果如下:
{
"args": {},
"data": "",
"files": {},
"form": {
"name": "germey"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "11",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Mobile Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-5ecd0951-17a8dc447274d384cb1cd7e0"
},
"json": null,
"origin": "171.107.139.104",
"url": "http://httpbin.org/post"
}
可以看到,我们达到了同样的POST请求效果,有了Request这个对象,就可以将请求当中独立的对象来看待,这样在进行队列调度时会非常方便。
上一篇: 爬虫学习记录——selenium基本使用
下一篇: 爬虫学习记录————ajax动态爬取