Python爬虫-04:贴吧爬虫以及GET和POST的区别
程序员文章站
2022-05-13 17:41:54
[TOC] 1. URL的组成 汉字通过URL encode(UTF 8)编码出来的编码,里面的字符全是打字节 如果你复制粘贴下来这个网址,出来的不是汉字,而是编码后的字节 https://www.baidu.com/s?wd=%E7%BC%96%E7%A8%8B%E5%90%A7 我们也可以在py ......
目录
1. url的组成
汉字通过url encode(utf-8)编码出来的编码,里面的字符全是打字节
如果你复制粘贴下来这个网址,出来的不是汉字,而是编码后的字节
https://www.baidu.com/s?wd=%e7%bc%96%e7%a8%8b%e5%90%a7
我们也可以在python中做转换-urllib.parse.urlencode
import urllib.parse.urlencode url = "http://www.baidu.com/s?" wd = {"wd": "编程吧"} out = urllib.parse.urlencode(wd) print(out)
结果是: wd=%e7%bc%96%e7%a8%8b%e5%90%a7
2. 贴吧爬虫
2.1. 只爬贴吧第一页
import urllib.parse import urllib.request url = "http://www.baidu.com/s?" keyword = input("please input query: ") wd = {"wd": keyword} wd = urllib.parse.urlencode(wd) fullurl = url + "?" + wd headers = {"user-agent": "mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/71.0.3578.98 safari/537.36"} request = urllib.request.request(fullurl, headers = headers) response = urllib.request.urlopen(request) html = response.read() print(html)
2.2. 爬取所有贴吧的页面
对于一个贴吧(编程吧)爬虫,可以翻页,我们可以总结规律
page 1: http://tieba.baidu.com/f?kw=%e7%bc%96%e7%a8%8b&ie=utf-8&pn=0 page 2: http://tieba.baidu.com/f?kw=%e7%bc%96%e7%a8%8b&ie=utf-8&pn=50 page 3: http://tieba.baidu.com/f?kw=%e7%bc%96%e7%a8%8b&ie=utf-8&pn=100
import urllib.request import urllib.parse def loadpage(url,filename): """ 作用: url发送请求 url:地址 filename: 处理的文件名 """ print("正在下载", filename) headers = { "user-agent": "mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/71.0.3578.98 safari/537.36"} request = urllib.request.request(url, headers=headers) response = urllib.request.urlopen(request) html = response.read() return html def writepage(html,filename): """ 作用:将html内容写入到本地 html:服务器响应文件内容 """ print("正在保存",filename) with open(filename, "wb") as f: f.write(html) print("-"*30) def tiebaspider(url, beginpage, endpage): """ 作用:贴吧爬虫调度器,复制组合处理每个页面的url """ for page in range(beginpage, endpage + 1): pn = (page - 1) * 50 filename = "第" + str(page) + "页.html" fullurl = url + "&pn=" + str(pn) html = loadpage(fullurl,filename) writepage(html,filename) if __name__ == "__main__": kw = input("please input query: ") beginpage = int(input("start page: ")) endpage = int(input("end page: ")) url = "http://tieba.baidu.com/f?" key = urllib.parse.urlencode({"kw":kw}) fullurl = url + key tiebaspider(fullurl, beginpage, endpage)
结果是:
please input query: 编程吧 start page: 1 end page: 5 正在下载 第1页.html 正在保存 第1页.html ------------------------------ 正在下载 第2页.html 正在保存 第2页.html ------------------------------ 正在下载 第3页.html 正在保存 第3页.html ------------------------------ 正在下载 第4页.html 正在保存 第4页.html ------------------------------ 正在下载 第5页.html 正在保存 第5页.html ------------------------------
3. get和post的区别
- get: 请求的url会附带查询参数
- post: 请求的url不会
3.1. get请求
对于get请求:查询参数在querystring里保存
3.2. post请求
对于post请求: 茶韵参数在webform里面
3.3. 有道翻译模拟发送post请求
- 首先我们用抓包工具获取请求信息
post http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionfrom=null http/1.1 host: fanyi.youdao.com connection: keep-alive content-length: 254 accept: application/json, text/javascript, */*; q=0.01 origin: http://fanyi.youdao.com x-requested-with: xmlhttprequest user-agent: mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/71.0.3578.98 safari/537.36 content-type: application/x-www-form-urlencoded; charset=utf-8 referer: http://fanyi.youdao.com/ accept-encoding: gzip, deflate accept-language: zh-cn,zh;q=0.9,en-us;q=0.8,en;q=0.7,en-ca;q=0.6 cookie: outfox_search_user_id=-1071824454@10.169.0.83; outfox_search_user_id_ncoo=848207426.083082; jsessionid=aaaiykbb5lz2t6ro6rcgw; ___rl__test__cookies=1546662813170 x-hd-token: rent-your-own-vps # 这一行是form表单数据,重要 i=love&from=auto&to=auto&smartresult=dict&client=fanyideskweb&salt=15466628131726&sign=63253c84e50c70b0125b869fd5e2936d&ts=1546662813172&bv=363eb5a1de8cfbadd0cd78bd6bd43bee&doctype=json&version=2.1&keyfrom=fanyi.web&action=fy_by_realtime&typoresult=false
- 提取关键的表单数据
i=love doctype=json version=2.1 keyfrom=fanyi.web action=fy_by_realtime typoresult=false
- 有道翻译模拟
import urllib.request import urllib.parse # 通过抓包方式获取,并不是浏览器上面的url地址 url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionfrom=null" # 完整的headers headers = { "accept" : "application/json, text/javascript, */*; q=0.01", "x-requested-with" : "xmlhttprequest", "user-agent" : "mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/71.0.3578.98 safari/537.36", "content-type" : "application/x-www-form-urlencoded; charset=utf-8" } # 输入用户接口 key = input("please input english: ") # 模拟有道翻译传回的form数据 # 这是post发送到服务器的form数据,post是有数据提交到web服务器的,与服务器做一个交互,通过传的数据返回响应的文件,而get不会发数据 formdata = { "i":key, "doctype":"json", "version":"2.1", "keyfrom":"fanyi.web", "action":"fy_by_realtime", "typoresult": "false" } # 通过转码 data = urllib.parse.urlencode(formdata).encode("utf-8") # 通过data和header数据,就可以构建post请求,data参数有值,就是post,没有就是get request = urllib.request.request(url, data=data, headers=headers) response = urllib.request.urlopen(request) html = response.read() print(html)
结果如下:
please input english: hello b' {"type":"en2zh_cn","errorcode":0,"elapsedtime":1,"translateresult":[[{"src":"hello","tgt":"\xe4\xbd\xa0\xe5\xa5\xbd"}]]}\n'
下一篇: 实战node静态文件服务器的示例代码