欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

用BeautifulSoup 爬人人词典中对应ANKI单词库内容

程序员文章站 2022-05-04 18:09:00
...

由于最近开始备考研,导致沉迷ANKI不能自拔。下载了一个历届考研真题的单词库,背着背着觉得不带劲。于是想到了情景记忆法背单词,自然就联想到美剧!要是能把每个单词找出美剧里对应的句子、语音和翻译该多好!

这不就是人人词典么!事不宜迟,立马就导出了单词库的单词列表,开爬人人词典上对应的内容。一开始还害怕人人词典会不好弄,没想到爬虫基础库都可以搞掂…

源码:

import pandas as pd
from urllib.parse import urlencode
import requests
from bs4 import BeautifulSoup as bs
import re
import urllib.request
import csv
import time
headers=['words','english','chinese','source']
file=pd.read_csv('./anki.csv',names=headers)
file.head()
words english chinese source
0 contend NaN NaN NaN
1 perceptive NaN NaN NaN
2 lameness NaN NaN NaN
3 mobilize NaN NaN NaN
4 plead NaN NaN NaN
file.drop_duplicates('words',inplace=True)
wordslist = file['words']
len(wordslist)
771
new_words = file[file['words']=='contend']
words_91dict =pd.DataFrame(new_words)
words_91dict
words english chinese source
0 contend NaN NaN NaN
url='http://www.91dict.com/words?'
def get_page(keyword):
    url_words = urlencode({'w':keyword})
    target_url = url+url_words
    return target_url
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"}
def get_content(target_url):
    response = requests.get(target_url, headers=headers).content
    content = bs(response, 'lxml')
    return content
def get_target(content):
    target = content.select("#flexslider_2 > ul.slides.clearfix > li:nth-of-type(1)")
    soup = bs(str(target),'lxml')
    words = {}
    image_pattern = re.compile(r"<img src=\"(.*?)\"/>").findall(str(target))
    audio_pattern =  re.compile( r"<audio src=\"(.*?)\">").findall(str(target))
    source_pattern = re.compile(r'''</audio>\s(.*)\s</div>''').findall(str(target))
    english_pattern = re.compile(r'''<div class="mBottom">\s(.*?)</div>''').findall(str(target))
    chinese_pattern = re.compile(r'''<div class="mFoot">\s(.*?)</div>''').findall(str(target))

    words['image']=image_pattern[0]
    words['audio']=audio_pattern[0]
    words['source']=source_pattern[0]
    words['english']=english_pattern[0]
    words['chinese']=chinese_pattern[0]
    return words

这里有个超级无敌大的坑!就是匹配数据的时候,一开始不论我用CSS SELECTOR 还是 正则表达式都匹配失败!研究了半天原来是html源码里要匹配的那段内容后面有个换行符!”\s”

所以当你发现怎么都匹配不到的时候就要小心考虑这些迷你隐藏坑!

for i in wordslist:
    try:
        target_url = get_page(i)
        content = get_content(target_url)
        words = get_target(content)
        urllib.request.urlretrieve(words['audio'], 'E:/91dict/'+i+'.mp3')
        urllib.request.urlretrieve(words['image'], 'E:/91dict/'+i+'.jpg')
        new_words = file[file['words']==i]
        new_words['english']=words['english']
        new_words['chinese']=words['chinese']
        new_words['source']=words['source']
        new_words['words']=i
        print(new_words)
        words_91dict = pd.concat([words_91dict,new_words])
        time.sleep(5)
    except:
        print('something went wrong')
     words                                            english  \
0  contend  In Syria, the governor has invading Parthians ...   

                chinese               source  
0  叙利亚得总督要对付的可是入侵的帕提亚人呢  来自《公元:《圣经故事》后传 第7集》  
        words                                            english      chinese  \
1  perceptive  Oh, top marks, like I said, you are <em>percep...  一点不错 果然有洞察力   

         source  
1  来自《X战警:逆转未来》  
words_91dict.drop_duplicates(inplace=True)
words_91dict.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 706 entries, 0 to 1350
Data columns (total 4 columns):
words      706 non-null object
english    705 non-null object
chinese    705 non-null object
source     705 non-null object
dtypes: object(4)
memory usage: 16.5+ KB
words_91dict.to_csv('./wors91dict.csv')