爬虫——requests基本请求,get和post
get请求
response=requests.get(url=url,params=params,headers=headers)
- 1
不带参数:
response 对象其他重要属性
import requests
url=‘https://www.sogou.com/’
response=requests.get(url=url)
#二进制(byte)类型的页面数据
print(response.content)
返回一个响应状态码
print(response.status_code)
200
返回响应头信息–字典的形式
print(response.headers)
{‘Server’: ‘nginx’, ‘Date’: ‘Tue, 19 Mar 2019 07:31:40 GMT’, ‘Content-Type’: ‘text/html; charset=UTF-8’,
‘Transfer-Encoding’: ‘chunked’, ‘Connection’: ‘keep-alive’, ‘Vary’: ‘Accept-Encoding’,
‘Set-Cookie’: 'ABTEST=0|1552980700|v17;
获取请求的url
print(response.url)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
带参数
url直接跟参
# 方式2:
import requests
url = 'https://www.sogou.com/web'
将参数封装到字典中
params = {‘query’:‘周杰伦’,‘ie’:‘utf8’}
response = requests.get(url=url,params=params)
response.status_code
#print(response.content)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
自定义请求头
post请求
模拟登陆豆瓣网站抓取登陆后页面
怎么样分辨url?
打开抓包工具,账号密码栏提交一般是form表单post请求,抓包工具查post请求,且post请求fromdata携带了账号密码参数的那个url,然后把formdata的参数取出来用字典封装下。
# post请求:登陆豆瓣网,获取登陆成功后的数据
import requests
指定url
url=‘https://accounts.douban.com/j/mobile/login/basic’
data={
‘ck’:’’,
‘name’:’’,
‘password’:’’,
‘remember’:‘false’,
‘ticket’:’’
}
headers={
‘User-Agent’:‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36’}
发起请求 requests.post(url, data=None, json=None, **kwargs)
response = requests.post(url=url,data=data,headers=headers)
请求发送成功之后,获取页面数据
page_text=response.text
print(page_text)
持久化操作
with open(‘douban.html’,‘w’,encoding=‘utf-8’) as f:
f.write(page_text)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
ajax请求
get请求
还是普通get方法
import requests
# url='https://movie.douban.com/j/chart/top_list?type=13&interval_id=100%3A90&action=&start=60&limit=20'
url='https://movie.douban.com/j/chart/top_list?'
params={
"type":"13",
"interval_id":"100:90",
"action":"",
"start":"80",
"limit":"20",
}
# 重点关注start limit
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
response=requests.get(url=url,params=params,headers=headers)
获取的是json格式的数据
print(response.text)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
post请求
与一般post请求一致,抓包工具找出url,找出参数,data传参。
小结
记住这两句语法就够了,知道用抓包工具分析url,参数
response=requests.get(url=url,params=params,headers=headers)
response = requests.post(url=url,data=data,headers=headers)
- 1
- 2