Python爬虫实现百度翻译(手机版),详解sign的构造方法
本文作为笔记学习用:
爬百度翻译这个程序进行的并不顺利,我在sign这个参数上卡了很长时间。下面是此参数的分析以及解决过程:【结尾有源码】
分析过程:
尝试不同的语句翻译,判断出 sign值是不断改变的:“我爱我的祖国” sign值
“爱我中华”sign值
“海明威”sign值
(1)猜测sign值是js生成的:全局搜索“sign”
如上图所示,分析得 y(a) 就是我们要找的js代码,设置断点,我们去此函数中看看
所以,我们只需要在python中执行此js函数,就可以得到 sign值
这里将此js代码复制到pycharm中,格式化,进行测验:
扣出来的代码放到 测试.js
中,如下图所示
创建 Test.py 测试该段js代码,如下图所示
可惜,遇到了报错:
根据报错信息得知,js中未定义 i
变量,我们再回到js中看看:
果然缺少i
的定义,那么这个i
是什么呢? 我们继续去打断点,分析i
,惊喜的发现,i
的值,是一个常量,如下图所示:“我爱我的祖国”i值为: "320305.131321201"
换一个词语再试试,“人生路坎坷”的i值为: "320305.131321201"
接连尝试N组以后,得i为常量,接着把i值定义为 "320305.131321201"
,继续测试js代码,结果报错,报缺少对象的错误(如下图所示),缺少什么对象呢?没有报清楚,我们自己去找。
我们通过重新去看js代码得到,我们缺少定义 a和n
这个两个函数,(如下图所示)
所以,我们接着去寻找a和n这个两个函数:(如图所示,找到了n函数
)a函数
所以,我们将a和n函数复制粘贴到 测试.js中即可,此时测试.js
如下图所示:
我们启动 **test.py
**测试,发现确实返回正确结果,如下图所示:
至此,js分析结束,我们取得了 **sign
**的值。
接下来,就是发送post请求,获取数据,进行数据处理。
完整代码如下:
Fanyi_Sprider02.py
import requests
import json
import execjs
class Fanyi_Sprider:
def __init__(self, language_contant):
self.url_detact = "https://fanyi.baidu.com/langdetect"
self.url_res = "https://fanyi.baidu.com/basetrans"
self.header = {
"cookie": "BIDUPSID=4605EF59F8ADD4DDC07A417C10B9F3C0; PSTM=1578983328; BAIDUID=4109A7FBACC39A30E12409E85FF9E29B:FG=1; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; REALTIME_TRANS_SWITCH=1; FANYI_WORD_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; APPGUIDE_8_2_2=1; DOUBLE_LANG_SWITCH=0; from_lang_often=%5B%7B%22value%22%3A%22en%22%2C%22text%22%3A%22%u82F1%u8BED%22%7D%2C%7B%22value%22%3A%22zh%22%2C%22text%22%3A%22%u4E2D%u6587%22%7D%5D; delPer=0; H_PS_PSSID=1429_21095_18559_26350_30498; PSINO=1; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1579266775,1579314208,1579402140; to_lang_often=%5B%7B%22value%22%3A%22en%22%2C%22text%22%3A%22%u82F1%u8BED%22%7D%2C%7B%22value%22%3A%22zh%22%2C%22text%22%3A%22%u4E2D%u6587%22%7D%5D; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1579402168; Hm_lvt_afd111fa62852d1f37001d1f980b6800=1579314657,1579402168; Hm_lpvt_afd111fa62852d1f37001d1f980b6800=1579402168; __yjsv5_shitong=1.0_7_6397b4f578d5cdd3e38a743c63141afd2cc0_300_1579409241580_112.8.215.146_666f9998; yjs_js_security_passport=54bd2e66dc50db2bd3cfb49cf968b3b4bc353348_1579409242_js",
"user-agent": "Mozilla/5.0 (Linux; Android 8.0.0; Nexus 6P Build/OPP3.170518.006) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Mobile Safari/537.36"}
self.detact_data = {
"query": language_contant
}
with open("BaiDuFanyi.js", 'r') as f:
resp = f.read()
self.sign = execjs.compile(resp).call('e', language_contant)
# 发送post请求
def _send_post_(self, urls, data):
response = requests.post(urls, data=data, headers=self.header).content.decode("utf-8")
return response
def detact(self):
# 完成语言检测,返回检测到的语言的种类
language_kind = json.loads(self._send_post_(self.url_detact, self.detact_data))["lan"]
return language_kind
def res(self, res_data):
p = self._send_post_(self.url_res, res_data)
print(p)
res_colletion = json.loads(p)["trans"][0]["dst"]
return res_colletion
def run(self): # 实现主要逻辑
'''
发送post请求返回数据
1.调用detact()完成语言检测,返回检测到的语言的种类
发送post请求返回数据
2.调用res(),完成语言翻译
'''
language_kind = self.detact()
res_data = {
"query": language_contant,
"from": "en" if language_kind == "en" else "zh",
"to": "zh" if language_kind == "en" else "en",
"token": "3018ae176904de63297751917421d1f7",
"sign": self.sign
}
res = self.res(res_data)
print(res)
if __name__ == '__main__':
language_contant = input("请输入要翻译的数据:\n")
fanyi_sprider = Fanyi_Sprider(language_contant)
fanyi_sprider.run()
BaiDuFanyi.js
var i = "320305.131321201"
function n(r, o) {
for (var t = 0; t < o.length - 2; t += 3) {
var a = o.charAt(t + 2);
a = a >= "a" ? a.charCodeAt(0) - 87 : Number(a),
a = "+" === o.charAt(t + 1) ? r >>> a : r << a,
r = "+" === o.charAt(t) ? r + a & 4294967295 : r ^ a
}
return r
}
function a(r) {
if (Array.isArray(r)) {
for (var o = 0, t = Array(r.length); o < r.length; o++)
t[o] = r[o];
return t
}
return Array.from(r)
}
function e(r) {
var o = r.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g);
if (null === o) {
var t = r.length;
t > 30 && (r = "" + r.substr(0, 10) + r.substr(Math.floor(t / 2) - 5, 10) + r.substr(-10, 10))
} else {
for (var e = r.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/), C = 0, h = e.length, f = []; h > C; C++)
"" !== e[C] && f.push.apply(f, a(e[C].split(""))),
C !== h - 1 && f.push(o[C]);
var g = f.length;
g > 30 && (r = f.slice(0, 10).join("") + f.slice(Math.floor(g / 2) - 5, Math.floor(g / 2) + 5).join("") + f.slice(-10).join(""))
}
var u = void 0
, l = "" + String.fromCharCode(103) + String.fromCharCode(116) + String.fromCharCode(107);
u = null !== i ? i : (i = window[l] || "") || "";
for (var d = u.split("."), m = Number(d[0]) || 0, s = Number(d[1]) || 0, S = [], c = 0, v = 0; v < r.length; v++) {
var A = r.charCodeAt(v);
128 > A ? S[c++] = A : (2048 > A ? S[c++] = A >> 6 | 192 : (55296 === (64512 & A) && v + 1 < r.length && 56320 === (64512 & r.charCodeAt(v + 1)) ? (A = 65536 + ((1023 & A) << 10) + (1023 & r.charCodeAt(++v)),
S[c++] = A >> 18 | 240,
S[c++] = A >> 12 & 63 | 128) : S[c++] = A >> 12 | 224,
S[c++] = A >> 6 & 63 | 128),
S[c++] = 63 & A | 128)
}
for (var p = m, F = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(97) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(54)), D = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(51) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(98)) + ("" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(102)), b = 0; b < S.length; b++)
p += S[b],
p = n(p, F);
return p = n(p, D),
p ^= s,
0 > p && (p = (2147483647 & p) + 2147483648),
p %= 1e6,
p.toString() + "." + (p ^ m)
}