欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

2020-12-01

程序员文章站 2022-07-08 12:34:16
@[toc] 爬取百度页面内容-python需求:在百度页面自定义搜索内容import requests 自定义搜索内容wd = input('请输入想要查询的内容:')定义参数字典 ```params = { "ie": "utf-8", "f": "8","rsv_bp": "1","rsv_idx": "1", "tn": "baidu","wd": wd, "fenlei": "256", "oq": "java","rsv_pq": "a10de332000...

@[toc] 爬取百度页面内容-python

需求:在百度页面自定义搜索内容

import requests

 自定义搜索内容
wd = input('请输入想要查询的内容:')

定义参数字典			 
```params = {
 "ie": "utf-8",
 "f": "8",
"rsv_bp": "1",
"rsv_idx": "1",
 "tn": "baidu",
"wd": wd,
 "fenlei": "256",
 "oq": "java",
"rsv_pq": "a10de33200010bdf",
 "rsv_t": "4cc2eJy9DIqlkz3yDhtxC9ELU6Guj7a6USLsNe1imFFQj8wGwFsLu7/fFVk",
 "rqlang": "cn",
 "rsv_enter": "1",
 "rsv_dl": "tb",
 "rsv_btype": "t",
 "inputT": "2305",
 "rsv_sug3": "39",
 "rsv_sug1": "35",
"rsv_sug7": "101",
 "rsv_sug2": "0",
 "rsv_sug4": "4593",
 "rsv_sug": "1",
}
定义请求头字典
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36',
    'Cookie': 'BIDUPSID=18ED7937E033F2F01C7C2B2CAC625A62; PSTM=1605764233;    
    BAIDUID=18ED7937E033F2F0A59CD5F74B955C7D:FG=1; BD_HOME=1;                  H_PS_PSSID=32811_1460_33047_33060_31254_33099_33100_32961_32957_31708;    BAIDUID_BFESS=18ED7937E033F2F0A59CD5F74B955C7D:FG=1; BD_UPN=12314353; 
delPer=0; BD_CK_SAM=1; PSINO=1; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598;    COOKIE_SESSION=20_0_2_9_0_10_0_0_0_3_0_2_0_0_0_0_0_0_1605769247%7C9%230    
_0_1605769247%7C1; H_PS_645EC=60a31GVLMW6tXduYVX2cmufB%2BbcJWNkuhMjFZ%2Fck4AI5WlyZ0YOszTvbDhc; BA_HECTOR=ah0k0120al242gaktd1frc6120q'
}

注意:

1. 以后不管写什么爬虫,都加上请求头,至少加一个UA

2. 如果加入UA之后还是访问不到数据,尝试加入cookie

3. 如果加入cookie还是不好用,将全部的请求头都加入,以:开头的请求头不要加入

# 保存文件
with open(f'{wd}.html','w',encoding='utf-8') as fp:
    fp.write(response.text)
print(response.text)
## 发现问题:返回的内容和网页源代码内容相差甚远
# 解决:
# 查看默认请求头
# print(response.request.headers)
# print(response.url)







本文地址:https://blog.csdn.net/weixin_50702242/article/details/110456888

相关标签: python爬虫 python