用BeautifulSoup 爬人人词典中对应ANKI单词库内容
程序员文章站
2022-05-04 18:09:00
...
由于最近开始备考研,导致沉迷ANKI不能自拔。下载了一个历届考研真题的单词库,背着背着觉得不带劲。于是想到了情景记忆法背单词,自然就联想到美剧!要是能把每个单词找出美剧里对应的句子、语音和翻译该多好!
这不就是人人词典么!事不宜迟,立马就导出了单词库的单词列表,开爬人人词典上对应的内容。一开始还害怕人人词典会不好弄,没想到爬虫基础库都可以搞掂…
源码:
import pandas as pd
from urllib.parse import urlencode
import requests
from bs4 import BeautifulSoup as bs
import re
import urllib.request
import csv
import time
headers=['words','english','chinese','source']
file=pd.read_csv('./anki.csv',names=headers)
file.head()
words | english | chinese | source | |
---|---|---|---|---|
0 | contend | NaN | NaN | NaN |
1 | perceptive | NaN | NaN | NaN |
2 | lameness | NaN | NaN | NaN |
3 | mobilize | NaN | NaN | NaN |
4 | plead | NaN | NaN | NaN |
file.drop_duplicates('words',inplace=True)
wordslist = file['words']
len(wordslist)
771
new_words = file[file['words']=='contend']
words_91dict =pd.DataFrame(new_words)
words_91dict
words | english | chinese | source | |
---|---|---|---|---|
0 | contend | NaN | NaN | NaN |
url='http://www.91dict.com/words?'
def get_page(keyword):
url_words = urlencode({'w':keyword})
target_url = url+url_words
return target_url
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"}
def get_content(target_url):
response = requests.get(target_url, headers=headers).content
content = bs(response, 'lxml')
return content
def get_target(content):
target = content.select("#flexslider_2 > ul.slides.clearfix > li:nth-of-type(1)")
soup = bs(str(target),'lxml')
words = {}
image_pattern = re.compile(r"<img src=\"(.*?)\"/>").findall(str(target))
audio_pattern = re.compile( r"<audio src=\"(.*?)\">").findall(str(target))
source_pattern = re.compile(r'''</audio>\s(.*)\s</div>''').findall(str(target))
english_pattern = re.compile(r'''<div class="mBottom">\s(.*?)</div>''').findall(str(target))
chinese_pattern = re.compile(r'''<div class="mFoot">\s(.*?)</div>''').findall(str(target))
words['image']=image_pattern[0]
words['audio']=audio_pattern[0]
words['source']=source_pattern[0]
words['english']=english_pattern[0]
words['chinese']=chinese_pattern[0]
return words
这里有个超级无敌大的坑!就是匹配数据的时候,一开始不论我用CSS SELECTOR 还是 正则表达式都匹配失败!研究了半天原来是html源码里要匹配的那段内容后面有个换行符!”\s”
所以当你发现怎么都匹配不到的时候就要小心考虑这些迷你隐藏坑!
for i in wordslist:
try:
target_url = get_page(i)
content = get_content(target_url)
words = get_target(content)
urllib.request.urlretrieve(words['audio'], 'E:/91dict/'+i+'.mp3')
urllib.request.urlretrieve(words['image'], 'E:/91dict/'+i+'.jpg')
new_words = file[file['words']==i]
new_words['english']=words['english']
new_words['chinese']=words['chinese']
new_words['source']=words['source']
new_words['words']=i
print(new_words)
words_91dict = pd.concat([words_91dict,new_words])
time.sleep(5)
except:
print('something went wrong')
words english \
0 contend In Syria, the governor has invading Parthians ...
chinese source
0 叙利亚得总督要对付的可是入侵的帕提亚人呢 来自《公元:《圣经故事》后传 第7集》
words english chinese \
1 perceptive Oh, top marks, like I said, you are <em>percep... 一点不错 果然有洞察力
source
1 来自《X战警:逆转未来》
words_91dict.drop_duplicates(inplace=True)
words_91dict.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 706 entries, 0 to 1350
Data columns (total 4 columns):
words 706 non-null object
english 705 non-null object
chinese 705 non-null object
source 705 non-null object
dtypes: object(4)
memory usage: 16.5+ KB
words_91dict.to_csv('./wors91dict.csv')
上一篇: 手动打包 Jar