欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

【Python爬虫】利用python自动翻译文本

程序员文章站 2024-02-03 20:19:22
...

【Python爬虫】利用python自动翻译文本

首先,打开 google 翻译网站。
https://translate.google.cn/ https://translate.google.cn/
【Python爬虫】利用python自动翻译文本

然后,让我们试着翻译几个单词,看一下网址会有什么变化。

翻译模式 翻译内容 对应网址
自动检测---->中文 hello https://translate.google.cn/#view=home&op=translate&sl=auto&tl=zh-CN&text=hello
自动检测---->英文 你好 https://translate.google.cn/#view=home&op=translate&sl=auto&tl=en&text=%E4%BD%A0%E5%A5%BD
英文---->中文 hello https://translate.google.cn/#view=home&op=translate&sl=en&tl=zh-CN&text=hello
中文---->英文 你好 https://translate.google.cn/#view=home&op=translate&sl=zh-CN&tl=en&text=%E4%BD%A0%E5%A5%BD

观察后发现,网址中 sl 后接源语言,tl 后接翻译后的语言,text 后接需要翻译的内容,其中 %E4%BD%A0%E5%A5%BD 是“你好”的 UTF-8 编码,于是,尝试将这一串字符直接换成“你好”,再次请求站点。

【Python爬虫】利用python自动翻译文本

发现直接使用中文也可以得到正确的内容。


于是,开始尝试通过 Python 爬虫来抓取页面并根据规则提取出翻译后的内容。

首先,使用自动检测---->中文,翻译 hello ,网址为 https://translate.google.cn/#view=home&op=translate&sl=auto&tl=zh-CN&text=hello ,打开网址,打开开发者工具,按 Ctrl+Shift+C ,然后鼠标点击页面上的“你好”字样,然后在开发者工具内,右击蓝色部分,依次点击 Copy->Copy Selector 。

【Python爬虫】利用python自动翻译文本

然后开始敲代码:

from requests_html import HTMLSession

session = HTMLSession()

link = 'https://translate.google.cn/#view=home&op=translate&sl=auto&tl=zh-CN&text=hello'

r = session.get(link)

f = r.html.find('body > div.container > div.frame > div.page.tlid-homepage.homepage.translate-text > div.homepage-content-wrap > div.tlid-source-target.main-header > div.source-target-row > div.tlid-results-container.results-container > div.tlid-result.result-dict-wrapper > div.result.tlid-copy-target > div.text-wrap.tlid-copy-target > div > span.tlid-translation.translation > span',first = True)

print(f)

代码没有错误,但是返回值 f 是None。于是考虑翻译后的内容是异步加载的,打开开发者工具中的 network ,重新翻译一遍 “hello”,观察一下,发现果然是异步加载的。

【Python爬虫】利用python自动翻译文本

依次点击预览,发现名为 single?client……中包含翻译结果。
【Python爬虫】利用python自动翻译文本
于是右击复制这部分的链接:

【Python爬虫】利用python自动翻译文本

重新开始敲代码:

from requests_html import HTMLSession

session = HTMLSession()

link = 'https://translate.google.cn/translate_a/single?client=webapp&sl=auto&tl=zh-CN&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=sos&dt=ss&dt=t&otf=1&ssel=0&tsel=0&xid=45626150&kc=13&tk=680344.843218&q=hello'

r = session.get(link)

print(r.text)

发现已经可以输入带有翻译结果的内容了:

[[["你好","hello",null,null,1]     
,[null,null,"Nǐ hǎo","heˈlō,həˈlō"]
]
,[["感叹词",["你好!","喂!"]
,[["你好!",["Hello!","Hi!","Hallo!"]
,null,0.13117145]
,["喂!",["Hey!","Hello!"]
,null,0.020115795]
]
,"Hello!",9]
]
,"en",null,null,[["hello",null,[["你好",1000,true,false]
,["您好",1000,true,false]
]
,[[0,5]
]
,"hello",0,0]
]
,1.0,[]
,[["en"]
,null,[1.0]
,["en"]
]
,null,null,[["名词",[[["hullo","hi","how-do-you-do","howdy"]
,""]
]
,"hello"]
,["惊叹词",[[["hi","howdy","hey","hiya","ciao","aloha"]
,"m_en_us1254307.001"]
]
,"hello"]
]
,[["名词",[["an utterance of “hello”; a greeting.","m_en_us1254307.006","Colin Spencer still stood by the desk no one signed in at; and he still smiled and nodded his hellos and goodbyes to every oblivious face that passed him by as though he was host to this year's biggest A-list birthday bash."]
]
,"hello"]
,["惊叹词",[["used as a greeting or to begin a telephone conversation.","m_en_us1254307.001","But instead of a normal greeting like saying hello or something, they hugged."]
]
,"hello"]
,["动词",[["say or shout “hello”; greet someone.","m_en_us1254307.007","After all the helloing and such, he would sit down and talk to me in a gruff, military kind of way."]
]
,"hello"]
]
……

于是尝试提取翻译内容:

for i in range(5,100):
    if content[i] == '"':
        count = i
        break

print(content[4:count])

成功提取翻译内容。

完整代码如下:

from requests_html import HTMLSession

session = HTMLSession()

link = 'https://translate.google.cn/translate_a/single?client=webapp&sl=auto&tl=zh-CN&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=sos&dt=ss&dt=t&otf=1&ssel=0&tsel=0&xid=45626150&kc=13&tk=680344.843218&q=hello'
r = session.get(link)

content = r.text

for i in range(5,100):
    if content[i] == '"':
        count = i
        break

print(content[4:count])