【Python爬虫】利用python自动翻译文本
程序员文章站
2024-02-03 20:19:22
...
【Python爬虫】利用python自动翻译文本
首先,打开 google 翻译网站。
然后,让我们试着翻译几个单词,看一下网址会有什么变化。
翻译模式 | 翻译内容 | 对应网址 |
---|---|---|
自动检测---->中文 | hello | https://translate.google.cn/#view=home&op=translate&sl=auto&tl=zh-CN&text=hello |
自动检测---->英文 | 你好 | https://translate.google.cn/#view=home&op=translate&sl=auto&tl=en&text=%E4%BD%A0%E5%A5%BD |
英文---->中文 | hello | https://translate.google.cn/#view=home&op=translate&sl=en&tl=zh-CN&text=hello |
中文---->英文 | 你好 | https://translate.google.cn/#view=home&op=translate&sl=zh-CN&tl=en&text=%E4%BD%A0%E5%A5%BD |
观察后发现,网址中 sl 后接源语言,tl 后接翻译后的语言,text 后接需要翻译的内容,其中 %E4%BD%A0%E5%A5%BD 是“你好”的 UTF-8 编码,于是,尝试将这一串字符直接换成“你好”,再次请求站点。
发现直接使用中文也可以得到正确的内容。
于是,开始尝试通过 Python 爬虫来抓取页面并根据规则提取出翻译后的内容。
首先,使用自动检测---->中文,翻译 hello ,网址为 https://translate.google.cn/#view=home&op=translate&sl=auto&tl=zh-CN&text=hello ,打开网址,打开开发者工具,按 Ctrl+Shift+C ,然后鼠标点击页面上的“你好”字样,然后在开发者工具内,右击蓝色部分,依次点击 Copy->Copy Selector 。
然后开始敲代码:
from requests_html import HTMLSession
session = HTMLSession()
link = 'https://translate.google.cn/#view=home&op=translate&sl=auto&tl=zh-CN&text=hello'
r = session.get(link)
f = r.html.find('body > div.container > div.frame > div.page.tlid-homepage.homepage.translate-text > div.homepage-content-wrap > div.tlid-source-target.main-header > div.source-target-row > div.tlid-results-container.results-container > div.tlid-result.result-dict-wrapper > div.result.tlid-copy-target > div.text-wrap.tlid-copy-target > div > span.tlid-translation.translation > span',first = True)
print(f)
代码没有错误,但是返回值 f 是None。于是考虑翻译后的内容是异步加载的,打开开发者工具中的 network ,重新翻译一遍 “hello”,观察一下,发现果然是异步加载的。
依次点击预览,发现名为 single?client……中包含翻译结果。
于是右击复制这部分的链接:
重新开始敲代码:
from requests_html import HTMLSession
session = HTMLSession()
link = 'https://translate.google.cn/translate_a/single?client=webapp&sl=auto&tl=zh-CN&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=sos&dt=ss&dt=t&otf=1&ssel=0&tsel=0&xid=45626150&kc=13&tk=680344.843218&q=hello'
r = session.get(link)
print(r.text)
发现已经可以输入带有翻译结果的内容了:
[[["你好","hello",null,null,1]
,[null,null,"Nǐ hǎo","heˈlō,həˈlō"]
]
,[["感叹词",["你好!","喂!"]
,[["你好!",["Hello!","Hi!","Hallo!"]
,null,0.13117145]
,["喂!",["Hey!","Hello!"]
,null,0.020115795]
]
,"Hello!",9]
]
,"en",null,null,[["hello",null,[["你好",1000,true,false]
,["您好",1000,true,false]
]
,[[0,5]
]
,"hello",0,0]
]
,1.0,[]
,[["en"]
,null,[1.0]
,["en"]
]
,null,null,[["名词",[[["hullo","hi","how-do-you-do","howdy"]
,""]
]
,"hello"]
,["惊叹词",[[["hi","howdy","hey","hiya","ciao","aloha"]
,"m_en_us1254307.001"]
]
,"hello"]
]
,[["名词",[["an utterance of “hello”; a greeting.","m_en_us1254307.006","Colin Spencer still stood by the desk no one signed in at; and he still smiled and nodded his hellos and goodbyes to every oblivious face that passed him by as though he was host to this year's biggest A-list birthday bash."]
]
,"hello"]
,["惊叹词",[["used as a greeting or to begin a telephone conversation.","m_en_us1254307.001","But instead of a normal greeting like saying hello or something, they hugged."]
]
,"hello"]
,["动词",[["say or shout “hello”; greet someone.","m_en_us1254307.007","After all the helloing and such, he would sit down and talk to me in a gruff, military kind of way."]
]
,"hello"]
]
……
于是尝试提取翻译内容:
for i in range(5,100):
if content[i] == '"':
count = i
break
print(content[4:count])
成功提取翻译内容。
完整代码如下:
from requests_html import HTMLSession
session = HTMLSession()
link = 'https://translate.google.cn/translate_a/single?client=webapp&sl=auto&tl=zh-CN&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=sos&dt=ss&dt=t&otf=1&ssel=0&tsel=0&xid=45626150&kc=13&tk=680344.843218&q=hello'
r = session.get(link)
content = r.text
for i in range(5,100):
if content[i] == '"':
count = i
break
print(content[4:count])
上一篇: Python进度条tqdm
下一篇: Python04-装饰器与迭代器