欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

python网络爬虫教程(四):详解requests库

程序员文章站 2022-03-22 16:56:15
...

上一章中,我们了解了urllib的基本用法,详情可浏览如下链接python网络爬虫教程(三):详解urllib库,但其中确实有不方便的地方,为此,我们可以使用更方便更简洁的HTTP请求库requests来帮我们完成爬虫任务。

如果你没有安装requests,无论是Windows、Linux还是Mac,都可以在命令行界面中运行如下命令,即可完成requests库的安装:

pip install requests

如果你没有安装pip,可参考以下文章:写给初学者的Python与pip安装教程

基本用法

1. GET请求

1. 基本实例
HTTP中最常见的就是GET请求,在requests中可以用如下方法实现:

import requests

response = requests.get('http://httpbin.org/get')
print(type(response))
print(response.text)

结果如下:

<class 'requests.models.Response'>
{
  "args": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.23.0",
    "X-Amzn-Trace-Id": "Root=1-5eccc552-e9bfd8204c6d591075a2b890"
  },
  "origin": "171.107.139.104",
  "url": "http://httpbin.org/get"
}

可以发现,get()方法返回值是requests.model.Response类型对象,并且这样的请求方法与urllib中的urlopen()方法如出一辙。

如果要添加参数,除了在URL中构造外,还可以使用get()方法的params参数:

import requests

data = {
    'name': 'germey',
    'age': 22
}

response = requests.get('http://httpbin.org/get', params=data)
print(response.text)

运行结果如下:

{
  "args": {
    "age": "22",
    "name": "germey"
  },
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.23.0",
    "X-Amzn-Trace-Id": "Root=1-5eccc6f8-ee3d14ec71ec38bec92961aa"
  },
  "origin": "171.107.139.104",
  "url": "http://httpbin.org/get?name=germey&age=22"
}

我们构造一个字典类型的参数data,利用params参数将data构造为URL的参数,这种方法更加简便,且可读性更强。
另外,网页的返回结果是str类型,但是它很特殊,是JSON格式的,如果我们要解析返回结果,得到一个字典的话,可以用jsos()方法:

import requests

response = requests.get('http://httpbin.org/get')
print(type(response.text))
print(response.json())
print(type(response.json()))

运行结果如下:

<class 'str'>
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.23.0', 'X-Amzn-Trace-Id': 'Root=1-5eccc86b-7372a90f1143321aa6393206'}, 'origin': '171.107.139.104', 'url': 'http://httpbin.org/get'}
<class 'dict'>

可以发现,调用json()后,返回结果转化为了字典。但是如果返回结果不是JSON格式,便会出现解析错误。

2. 抓取二进制数据
如果我们要获取网页内容,可以在返回结果的Request中用text属性,实际上它的内容是一个HTML文档,如果想抓取图片、音频、视频等内容,可以使用Response的content属性,它的内容是这些文件的二进制码。

下面我们来试着爬取本页面菜单栏上的CSDN图标,如下所示

python网络爬虫教程(四):详解requests库

import requests

response = requests.get('https://csdnimg.cn/cdn/content-toolbar/csdn-logo.png?v=20200416.1')
print(type(response.text))
print(type(response.content))

运行结果如下:

<class 'str'>
<class 'bytes'>

其中content属性的内容是bytes类型,试着将它保存到文件中:

import requests

response = requests.get('https://csdnimg.cn/cdn/content-toolbar/csdn-logo.png?v=20200416.1')
with open('text.ico', 'wb+') as f:
    f.write(response.content)

可以看到本目录中保存的文件就是我们要提取的图片:
python网络爬虫教程(四):详解requests库同样地,音频和视频文件也可以用这样的方法提取。

3. 添加headers
与urllib.request一样,我们也可以通过headers参数来传递头信息,下面以知乎的发现页面为例,如果不加请求头:

import requests

response = requests.get('https://www.zhihu.com/explore')
print(response.text)

结果如下:

<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>openresty</center>
</body>
</html>

我们发现访问被拒绝了,原因是如果我们不加请求头,目标服务器识别出我们的请求并非是由浏览器发出来的,于是拒绝我们的访问,我们可以在请求头的User-Agent字段中伪造请求源,令请求源为Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Mobile Safari/537.36,表示该请求是由Chrome浏览器发出来的,这样就成功实现了伪装,我们将上例代码改成以下所示:

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Mobile Safari/537.36'
}
response = requests.get('https://www.zhihu.com/explore', headers=headers)
print(response.text)

运行结果为:
python网络爬虫教程(四):详解requests库
成功访问了页面。

2. POST请求

了解了GET请求的用法,POST请求的方法也同样简单:

import requests

data = {'name': 'germey', 'age': 22}
response = requests.post('http://httpbin.org/post', data=data)
print(response.text)

结果如下:

  "args": {},
  "data": "",
  "files": {},
  "form": {
    "age": "22",
    "name": "germey"
  },
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Content-Length": "18",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.23.0",
    "X-Amzn-Trace-Id": "Root=1-5eccd5c6-b1b0d97280f0059458476556"
  },
  "json": null,
  "origin": "171.107.139.104",
  "url": "http://httpbin.org/post"
}

3. 响应

在上面的实例中,我们通过text和content属性获得了响应的内容,此外还有很多属性和方法可以获取其他信息,比如状态码、响应头、Cookie等,如下:

import  requests

response = requests.get('https://www.baidu.com')
print(type(response.status_code), response.status_code)
print(type(response.headers), response.headers)
print(type(response.cookies), response.cookies)
print(type(response.url), response.url)
print(type(response.history), response.history)

运行结果如下:python网络爬虫教程(四):详解requests库这里分别打印出status_code属性得到状态码,输出headers属性得到响应头,输出cookies属性得到Cookies,输出url属性得到URL,输出history属性得到请求历史。

高级用法

1. 文件上传

假如有些网站需要上传文件,我们也可以用requests模块实现,如下:

import requests

files = {
    'file': open('test.ico', 'rb')
}
response = requests.post('http://httpbin.org/post', files=files)
print(response.text)

结果如下:

{
  "args": {},
  "data": "",
  "files": {
    "file": "data:application/octet-stream;base64,iVBORw0KGgoAAAANSUhEUgAAAKAAAABYCAYAAAByDvxZAAATRUlEQVR4Xu1ceXAc1Zn/fW9kyyfGXMa2ZJsjYIORp6dblj09ksWGsEDYJASccGyWLMuxlRAIEFKpQLJhi2yKLJCwCUcqS3EEsoEUZLkWyBIQ0rRkS93TIwMGgrlkYS7b2EaxsKTpb+uNJe+MNOpjJDyTmn5V+mu+633vp9ff+77vPUI4Qg+U0ANUQt2h6tADCAEYgqCkHggBWFL3h8pDAIYYKKkHQgCW1P2h8hCAIQZK6oEQgCV1f6g8BGCIgZJ6IARgSd0fKg8BGGKgpB4IAVhS94fKQwCGGCipB0IAltT9ofIQgCEGSuqBEIAldX+oPARgiIGSeiAEYEndHyoPARhioKQeCAFYUveHykMAhhgoqQdCAJbU/aHyEIAhBkrqgRCAJXV/qDwEYIiBknogBGBJ3R8q/9QA+HbihLkCUxpY0HIClhJ4CRMdDsZhBEwHMA2AYCBDwB4Auxm8k4APmWgLA68Jh//MQKqmLfUiAc7ElmttJBbrPcIR4hg4mCsEZjmE2WCeuVcuDQLI/gk4uxm0iwR/hKHIViLaYllt72PCNuTMYO3aCH7/+8zE5jTp3KQojYcAvIDIOZgF5gJiJoGnM4OIwAzezcCuCIv3AOqxrLZ3J2LFpAHwQSCyulFJEIm/Y8YpAC8DkZiIcSO8BHy8cObOQ+nJTRKovsaKFasXUoROE0RrmBAF42gQqn0xFyTiAQCvM9MLBFov4DxiWe2vjyaNafqlDFrJ4HkEuYA8FyAJ8pkAT2NGFQ37hZkdIhoAoR+M7QBvY8YWIvEG2NlETHZ1tdPd0dHRX7zdYzlVtXF+hpzlxLwMgpZmfQMcAWARgKkBde0CkAKjlZgeTaWSVhD+CQPwjbiyeGoVXQTQ+QzUBFHum5axvbbNOtiLfsmS5mkHHTp0DjMuBHi13Na8eIr+nfkm22r/zmj+qKa/QsCxRcsdyzgEUCeIn3KAh7u7jJeKla1o8XsBOgXAocXK8ORjdBPRbSlz/p2A9w5f9AJtbo4dTQ5dy8C5AKZ4GjYxgnRtq6W4iBCKql8Mwg8ALJiYKn/cTPTNdFfytgIA3EXAbH9SiqGiTiL+j1SX8V9BQ4Kopm8j4KBitAbmIXRHWFxsmm2dbryBAfjeyXUzB/urfgjC5QBN4JPmf0oEfrSmNfXFQhwnaIkjI+DfEtDgX+KkUJ5mm8aTuZKOO655VvWMwY8nRbqnEHqZgW+kzWSLJykAVVWnODRNhjCB19yP/HFoPnEI53d3GQ+OJyOQMZsbtZUg5z6APjMBo4pgpVtrW81LRzOqaqLBIecJgDw/z0UodWXJAEs3mMaruUTRaNNnqCrz58nWNa48GUOCbk5ZxncBsJteGROLKaJ3v9n2/4oyDJyaNo3/LaTbNwDfadIuyIDvoE//c1vIzu/Vtlo35P5Qp+nHRhgdIBno798hDw99O+fM2LTpybxDkVIfbwLT8/vXmuye9lu7y/ia2ydZVVfFHIoEOiBM4jy29ovBpa90dm4bLdMXAHsb1R+AcB1PxvbN2E2EPgfYTcAQgBkAphNwAAORQpNmonMXPW/KmGdkCEXTZWyhFuGkITDeAtADQh8zMiCeStmTKh8wvJvKIF3aVXgw3rEtY8yBS9HiZwOUa2dB/uHTb79MLfHeU+ckhDJ0s20mrxrP5Fh94hRmzgsZivBd8SzEN9ldYw9tngDsbVKvYuDGIjVvIUBuvc9FmF4eGurfVNvx0vaCIFuLSG/visMj1WK+49CRBCxjIMYg3cngS4sNKznC53ehc/Uw8EeAb+uvpmdfNQzPOE1V1TlDYloNObyEAGnPUiYcD5ACRrdtGU2j5xHT9CsYuNnNV8R8S8pqvzJ3t1q7dm2kp6dn5uDg1CUsMscRKOYAZ0q9AfzOJMSpqc62pwvxRFX9fCLc7SmPcA8YbQx63QG/G8mI7ZnM9N1DQwM0a1akemhoYKFDvJzIOQWgs+Tm4SlzL8Guj7ZOmffWWy2f5NK7ArC3Uf08Ex4LGLjKWOSPQjg3Lmyxn/FpnMtmA4HmZkEtLXK3zI6olnicwJ/3KVvac5FtGnf6pPciE3UNzQs2rG8ZE09FNf2nBFztJoCBK9Km8XMvJXvn2agLODexzwMWA29G+JNjLcuSCfW8EdPi32VQXhhT0IYMFtq2scWPfYqiL0AE9wA4yQ89gC/apvGoLwD26MoCiogNAPwH+DKJCucfatrsP/k0qCgyRdM/CJDLesQ2jS8VpSggU1TTf0PA37uz8Tm22f67AKJliukaEP7VDw8RXZDqSt41mlapj98IpnE/0cP0vGf3R9M2btwok+6+xtFHH109+8B58uukeTIwfmJbxvd9AfCdJvU+BzjPU+g+Au6KCDptQYu11T9PcZSKpsvdsGC8OEYi43bbMr5RnKZgXIqWeAbgz7pykWi2u9oCH1QUVf8JCN/zYZFhm0ZiNJ2/fw7ssE0j8KEuqumfo2yI4zkes03jC54A7NEVjSJCBvmeMWJWGKNnKCIajmjpes/ThEkgUDT9Hf8JZ94JiLNsMznhcMDLdEVNbATxMje6QukbL7ny9+bm5qqdfYMyxSNLZuMPZkcgUjO6RqtougTI51xZgVfTprHUjz25NMO27dhbbnTV0Gmb7Xn52oIA29yk3gf/ux8TRfSa5zs7ghpeLL2iJe4FWKYdgowuMD2ECJ6yO5MytHDNmwURPEKraPpHAA50450aGZqzfv16WT8NPGKafgkDd3gxMosz0lbbf+fSKareDUKdB2+LbRoneskv9HtM019hzxIkvWybyeNy+ccAcPtJ6py/7MEHIH9FaQY9tqjVzNtWi5lAEJ46Lb5cMJlUfHPBh8x4ThD9aRB45gUz+UYQ/YVoV69ePf2TQbHbQ85u2zQ8donxJdTVxQ+LTCX5lXH9MjH4R2mz/bo8AGq67Fo53HV/YtyftgyPGLawBJ818Jds01juCsDepthXGPSA3wVhxhmL2qy8/za/vBOh85Py8C2f+TWQeMxhfnzu7CltLTknbr8yZEmwCjymOyaXn4E30qZxlF+ZhegUH0AC4Xa7Ky/ulXlTebDwiJvpBttM+okzx5imaInNAHs1o3TZprHSC4B3MOgSP06SieS+zLSDlvrIq/mRF5QmWq9/hxg/9doRAsrdAsY9AnxnoXar8WTJlAnB2ZerHIeu4AEhiH2KqqdBWOHOQ7+zzeQ5IzTLGxrmTclUecfnzJfZVvsvgtgjaYdjQJnf8zoYPmWbxqnuAGxU25gw5hQ1jlGbalut/VwXzrdEVeMnO0QyLnIPzoN6VTbKEn7jRDLXpdetk5UT1xFVE2uJeNyi+zDzQ7ZpyORt0UPRdNOrAsTAfWkzW5rLjhXaakVApLyV8lm22f6QN10+RTS6aglVRd704mPG3WnL+Ed3ADap7zEwz0uY/J0YLTVtVlFBqx/5fmnq6upmiimzLyPg2yAc5pfPJ10/mK+yrfbb3egVNX4ZiG5xo2Hg1rRpjGmq8GlHlkzR9NeAbAOp2/iVbRr/PEIQW5k4nR2WBQX3IThud7YHPkxGtUQzgZ/zEk/Av6VM4xpXAG5uUmUg7be88nRtqyUbHMtiyINA/4A4jwjyv79xMj/NxPTrlJW8eLyJ+snTEejalJn8cbHOGj7o7PTqv2TwdWmz/Uf7AOj39DyUOSKd9t7tR9sfjennkYDMnLgOAr6VMo1fegFQdnj4assmYF1NqyU7j8tuRFetWkKZqi+AWcYcawL8U407FwJdlzKT+xY2l1Cp1+8G43wPR1w4kZJgLKY3skCrt7Odc22zY19ThKLq14HwQw8+/njH+9M3bfJ/7WFEXlTTryZkY3F3ABLOTHUZD3sBUNYB53sJG/59R02rdfDELwz51FYkmdw5BgbEqgxxkwCtYWBVUYBk7BHg4wsdTqKa/jQBJ7uZyMynp632J4qcBmL1+l3M+LoXv2BnmWV1vDJCp2j6fwL4Jw++bbZpHOIlu9DvUU2/mYArPHkzHLft/E/8mHxST6OaJILuKWyEgJ3m2jY7cGnJt/xPgbBm9erphw1F1jD4bDCfA5CvHT9rCvH1dle7bP3PG1FNf4GAvBzXaBrBQrWsNh+HgbGTrlvZdETEyUhQudrK4N602V6bK0HRdNmG5RUqvWCbhleiuuBqKJoua9tf9VyqDC+x7fa3XXfAnqbYzwj0bU9h+wCIltoyOIj4tncUYSzWWMfCafcuI+1jLFgtUDRd1sDdGzcCdJrkmjlc8G8DUO9jnr+0TeNbowAoKz8nuPIyPW1bSS+QFgagqreCsjG32yjY6DBmB+xdo57IjGd9TDSHhC6sbTUnq90pmOpJoPbz+cxRY9qmkQeEYYDIq5PjVihkE+qBs6dWB01yS9kHzJl3LxO+4meqgp06y+p4IZfWz2UkBt2VNpMX+NExmsbnyXyrbRpjbuONcRgDordJ3RQkr0ZAxiG+aNHzqTFtQMVMaD/zyHan10FY4kcvA8m0aeT9tytKfDEi5J4rZHxgW4av9NaIHcP3OGRVyldIRKAnUmby9Nx5LGlunja3b1BmNlzLd4VSJH78IWkUTf+Lawf5XkEbbNMYk0AvaNQ7zeqFjoNf+zVghI4IjzNnrqxtTctc1aQP+bnMkHNAt2XIT+YEX0rYa15Ui/8LgQqebAtNoFAuT1HiqxEhadP4Y28XddSPU449Vp89YxZdDmJZFvNbOx4gJxNNpda9nKtDVeNHOURyQ3EdBLo0ZSZv9aIb/buqnjTHoX7ZCeM1xlRBJENBALJsj39fNeCzEzdP896bWu0UoQcGM9y+ODJ7I7Xkt2EXslQ+5UFVVUcJxjJmoQmmZxe2mY/k0sa0+DUMup6B7SA8LxgdQw5bU2jqBstqCdSHqKzUNcrg+0w4w8tzub87jL/ptoy8pGusXv8yM9wrCMxP21b7uDFWQ0PDvIFMVQMDawlZm/wCL2seE65Odxljrk4oSrwJEe+LUlQgReLHL7FY0zIWmY1etON94sfdlnsTK45hEVkHUOAGxVHGZEB4C8zbAepjUJ+QmxdjOhPmEOhQRrZ6kX8JiJ2zatvsvEWNavFfEWi8ZPAOAt6UbekE9DBjl7z8xMx9LOTlJzFDMB/GgGwGkLlL+QxF0NFhm0Z8NFNMS3yTwXkJ1gKCZYwoXzXYwow+AbAjaA4xzyVgEQN5J9cghhHhwVSXcXahFjO/92fIwepUylgXRK+kVZTGzyLiePZaEujHKTN57Wj5rnHBuyfWr8lknP9htxtiQS32Se8wJRa3mXIX3jcUNf4UiP7Wp4jJJWP0OQKrCj2NEdMS1zM4r8Q0ucpdpT0m+JMzC90DyQJETVwFYs9LZYKx2LKMnqB2K1riawDf68XHoEvTBT7xnh3PbzdqugA/CtpPTzoMz8Rx6KjFSTOvT0/R4hsBcu049nJEMb8zy9e7+MzxksiKFr8ToKJOkMXYM8wj36v6hd01/0q3N1gULXETwPIWntsIfBdkRJjfy06OQ1/uTiX/EGgHHCHOPkA0RdzH7LtLZgJ+HWYdnDqjdtSrUIqm9wWNjSZqCIPfJ4p81e0eh89E70RNyeXfQg5dkkolH/cS6jNJ/KFtGkU1cUTV+M+J6HIvOzJMqzZYyfVFAVAyZdMzjeoFTLiefHbLeBnl8vuO2lYrL/ZU1eZDHBr8cAIyg7IyA/c7A3zVhg3t8hbeuMNfj15Q9YXoeSeYbtnTP+XfN25skf+MniOq6W0Ej40jwAl9tMKoqj9APnKUgxRZ9GJX6+aiATjCuEVVZwzNwtfJwWWgSX2GbJ9tBLxc02rl3R1YUa8fTw5vGHlbz9PzxRLIzy3RwxC4we5MdvsRo6j6+59CG9iIavmP0EnM9++eRnf7uVSfa7Oi6TKM8eqVLJgi8TN3fwCXF6X2TCsUp3rGgG5GvJ2IqVWCznLAJ2VfDPDuiPUzJ5kberCm1RpTW6yvb6odgnMGmE9hIDGJz6DtYlAbiB6JOJE/BEzpkFKfeBKOcyRAiyb2CGbWPfLCkjwtbyDiZKaKn+vu6JC3AIsaMU3/Gdi91u2AO9OWIS+YBx5Kvf5Dcsg1wc5Av20lx7ylKJVNCIC51m5raDhgT/VAlEksdxjHMGOBAObLVIt8jpeG3z/JfZIXwE6GTM/gPRB6BItXWAykalo2+HnFSayo15eRQytI8FL5jAUzauX7d8yYBcIsGS8yQ4AwQIx+JnxE4G0AbQaylYuNEOi2O+e/6OcxRR+rQ9Fo8xxU9x8qBqsOdoR8AljMFMKZmQFPJYciBPlCqnzqdu+TwMz0McPZ6cDZGslUb06nW/wkdX2Y8tdBMmkA/OuYbmhluXkgBGC5rUiF2RMCsMIWvNymGwKw3FakwuwJAVhhC15u0w0BWG4rUmH2hACssAUvt+mGACy3Fakwe0IAVtiCl9t0QwCW24pUmD0hACtswcttuiEAy21FKsyeEIAVtuDlNt0QgOW2IhVmTwjAClvwcptuCMByW5EKsycEYIUteLlNNwRgua1IhdkTArDCFrzcphsCsNxWpMLsCQFYYQtebtMNAVhuK1Jh9oQArLAFL7fphgAstxWpMHtCAFbYgpfbdEMAltuKVJg9/wdg3rqzVZRXGQAAAABJRU5ErkJggg=="
  },
  "form": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Content-Length": "5134",
    "Content-Type": "multipart/form-data; boundary=cb3879890fe3ecc2242d77aed4456739",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.23.0",
    "X-Amzn-Trace-Id": "Root=1-5eccd9cb-fd61d63c5b7393a0a4820390"
  },
  "json": null,
  "origin": "171.107.139.104",
  "url": "http://httpbin.org/post"
}

前一节中我们保存了一个CSDN图标文件,这次用它来模拟文件上传的过程,从运行结果可以看出,我们提交的文件被放在了files字段内,并且文件被加密成了base64格式。

2. Cookies

前面我们使用urllib处理过Cookies,写法非常复杂,而在requests中,获取和设置Cookie一步即可完成:

import requests

response = requests.get('https://www.baidu.com')
print(response.cookies)
for key, value in response.cookies.items():
    print(key + '=' + value)

运行结果如下:

<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
BDORZ=27315

我们首先登陆百度取得cookies,可以发现它是RequestsCookieJar类型,然后用items()方法遍历出每个Cookie的名称和值,实现Cookies的遍历和解析。我们也可以用Cookies来维持登陆状态,以知乎为例。
首先,登陆知乎,将Headers中的Cookie复制下来,如图所示:python网络爬虫教程(四):详解requests库将Cookie复制下来,添加到Headers里面,再发送请求:

import requests

headers = {
    'Cookie':'_zap=3f3a8847-850e-43e0-be78-b4be0bfa6995; d_c0="AIDX3Ryz2RCPTmMuf_Mm8rKi-lXSeAtT9EU=|1582267483"; _ga=GA1.2.903862747.1583308568; _xsrf=gJaEH2r1y8UDTP1Ej3IozSme9zAcufUL; tst=r; capsion_ticket="2|1:0|10:1588241682|14:capsion_ticket|44:YTY0NmRlOGM2NGE0NDY4OWJhNjRkNDI3ZDM2MjUyZjc=|2845e561cc3deefa38a0f3af569a97c251a4295347df5a19499c63cd06fc96c1"; r_cap_id="MjljZWU0MDJiZWQ4NDEyN2I1YWU2YWY0MzQyMGVmZTk=|1588241690|9a83e301b5a66e0c3a88b97148b2ae53b2f17438"; cap_id="ODEwZWI3NGM0ZDFjNGM2Yjg1MzhmY2YzN2Y5MzJiNDU=|1588241690|abd8b0789de34c3c9deaaff5a955957633d3f9f1"; l_cap_id="NDhlODVkZGI4MGJiNDMzZGJmMjk0NTcwOTczMmVhNmU=|1588241690|31dedb916b6140599e8cd7efc4c69fde323c7f4b"; z_c0=Mi4xbElYOUR3QUFBQUFBZ05mZEhMUFpFQmNBQUFCaEFsVk5KUE9YWHdDOEt2cFJDUW9qMlFnclZMLWNhRUdzX0YwNE5B|1588241700|b205b28d7e4b0a1bf4e8b554ab73574164eb0c73; q_c1=67dc75e659a445e3aab6b904ca4772ed|1588242255000|1588242255000; __utmv=51854390.110--|2=registration_date=20190531=1^3=entry_date=20190531=1; _gid=GA1.2.28121021.1590480973; __utmc=51854390; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1589720541,1589788910,1590480972,1590484318; BAIDU_SSP_lcr=https://www.baidu.com/link?url=C8qsxEG_x3OMJtP83xu1k1mrHxJPpf20TPUUFxPi7BS&wd=&eqid=ea3bb79e00024eb9000000065eccdd58; SESSIONID=ErO66ceM7PuCJd223PseKNutezkIlS8HiDTqimrnydQ; JOID=Vl8TA0i1_WTj2O72K7GZPsICyaY4844t1umOkmCFrhCmkY3HGNMnrbnc7PEuNDz0unQt3Q6KyGjuLwaRnKm_58E=; osd=V10VB0K0_2Ln0u_0LbWTP8AEzaw58Ygp3OiMlGSPrxKglYfGGtUjp7je6vUkNT7yvn4s3wiOwmnsKQKbnau548s=; __utma=51854390.903862747.1583308568.1590480996.1590484903.3; __utmb=51854390.0.10.1590484903; __utmz=51854390.1590484903.3.3.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/people/jude-78-40; _gat_gtag_UA_149949619_1=1; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1590485091; KLBRSID=0a401b23e8a71b70de2f4b37f5b4e379|1590485093|1590480970',
    'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Mobile Safari/537.36'
}
response = requests.get('https://www.zhihu.com/people/jude-78-40', headers=headers)
print(response.text)

我们发现运行结果中包含了登陆后的结果:python网络爬虫教程(四):详解requests库这说明登陆成功。

3. 会话维持

在requests中,如果直接用get()或post()等方法可以做到模拟网页的请求,但实际上这相当于不同的会话,也就是说相当于你用两个浏览器打开了不同的页面,如果你利用post()方法登陆了某个网站,然后再进入个人页面,企图用get()方法获得个人信息,这样显然是不能成功获得个人信息的,因为这是不同的会话,解决这个问题就是要维持同一个会话,我们可以在每次请求时设置同一个Cookie,但是显然这个过程非常繁琐,在requests模块中,提供了一个Session对象来帮我们解决这个问题。httpbin这个网站中,我们可以设置一个Cookie,名称是name,内容为germey,接着再请求当前网址获得Cookie,代码如下:

import requests

requests.get('http://httpbin.org/cookies/set/number/germer')
response = requests.get('http://httpbin.org/cookies')
print(response.text)

运行结果如下:

{
  "cookies": {}
}

说明我们并没有获得Cookie,再用Session试试看:

import requests

r = requests.Session()
r.get('http://httpbin.org/cookies/set/number/germer')
response = r.get('http://httpbin.org/cookies')
print(response.text)

运行结果如下:

{
  "cookies": {
    "number": "germer"
  }
}

成功获得了Cookie。说明Session可以维持会话,通常用于模拟登陆后的下一步操作。

4. SSL证书验证

requests提供了证书验证的功能,当发送HTTP请求的时候,它会检查SSL证书,如果访问页面的SSL证书过期或无效,就会出现一个证书问题的页面:python网络爬虫教程(四):详解requests库在requests中我们可以使用verify参数来控制是否检查此证书,如果不加verify的话,默认是True,会自动验证。例如:

import requests

re = requests.get('https://www.xxxx.cn', verify=False)
print(re.text)

我们也可以指定一个本地证书用作客户端证书,这可以是单个文件,也可以是一个包含两个文件路径的元组:

import requests

re = requests.get('https://www.xxxx.cn', cert=('/path/server.crt', '/path/key'))
print(re.text)

5. 代理设置

对于某些网站,在测试的时候请求几次,能正常获取内容,但是一旦开始大规模爬取,网站可能会弹出验证码,或者跳转到登陆验证页面,更甚者可能会直接封禁客户端IP,导致一段时间内无法访问。
为了防止这种情况发生,我们需要设置代理来解决这个问题,这就需要使用proxies参数:

import requests

proxy = {
        'http': '120.25.253.234:812',
        'https':'163.125.222.244:8123'
    }
re = requests.get('http://httpbin.org/get', proxies=proxy)
print(re.text)

其中代理服务器地址可以换成有效的地址

6. 超时设置

在本机网络状态不好或者服务器网络响应太慢或无法响应时,我们可能会等待特别久的时间才能收到响应,甚至最后接收不到响应,为了防止服务器不能及时响应,应该设置一个超时时间,即超过了这个时间还没有接收到响应,那就报错。者需要用到timeout参数,实例如下:

import requests

r = requests.get('https://www.baidu.com', timeout= 0.1)
print(r.status_code)

如果0.1秒内没有响应,则抛出异常
实际上,请求分为两个阶段,即连接和读取,上面设置的timeout将作用于两个阶段,如需分别指定,可以传入一个元组:

r = requests.get('https://www.baidu.com', timeout = (0.1, 0.5))

如果想永久等待,可以将timeout设为None,或者不指定timeout参数,因为其默认值就是None。

7. 身份验证

在访问网站时,我们可能会遇到这样的验证页面:python网络爬虫教程(四):详解requests库此时可以使用requests自带的身份验证功能:

import requests

r = requests.get('https://localhost:5000', auth=('username', 'password'))
print(r.status_code)

如果用户名密码正确,验证成功,就会返回200状态码

8. Prepared Request

前面我们介绍urllib时,我们可以将请求表示为一个数据结构,其中各个参数都可以通过一个Request对象来表示,这在requests里同样可以做到:

from requests import Request, Session

url = 'http://httpbin.org/post'
data = {
    'name': 'germey'
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Mobile Safari/537.36'
}
s = Session()
req = Request('POST', url, headers, data=data)
prepped = s.prepare_request(req)
r = s.send(prepped)
print(r.text)

运行结果如下:

{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "name": "germey"
  },
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Content-Length": "11",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Mobile Safari/537.36",
    "X-Amzn-Trace-Id": "Root=1-5ecd0951-17a8dc447274d384cb1cd7e0"
  },
  "json": null,
  "origin": "171.107.139.104",
  "url": "http://httpbin.org/post"
}

可以看到,我们达到了同样的POST请求效果,有了Request这个对象,就可以将请求当中独立的对象来看待,这样在进行队列调度时会非常方便。