python爬取MalaCards目录url
程序员文章站
2022-05-02 22:12:23
...
Python爬取MalaCards目录url
每天进步一点点~这段时间瞄上了genecards网站,还得就它的数据库鏖战几天。而要获得它的基因信息,第一步,就是获得它所有数据的url,那就得先获得它的目录了。接下来,我们进入它的网页中看一看步骤。
#!/usr/bin/env python
# -*- coding:utf-8 -*-
# _author_='[email protected]'
# To capture for MalaCards Database.
from bs4 import BeautifulSoup
import requests # help: https://blog.csdn.net/jojoy_tester/article/details/70545589
import csv
import re
首先是引入需要的包。bs4用于解析,由于网页解析容错率的考虑,还得安装xlml包。requests用于连接,获取网页文本。csv用于存储,re作为正则用来辅助解析。
def check_link(url):
try:
header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0'}
r = requests.get(url, headers=header)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
print('无法连接服务器')
连接网页是重点,因此一般都会检查看是否成功。试过直接请求都失败了,就加上了headers模拟浏览器进行请求。
def get_contents(ulist, rurl):
soup = BeautifulSoup(rurl, 'lxml') # help: https://cuiqingcai.com/1319.html
trs = soup.find_all('tr')
for tr in trs:
ui = []
for td in tr:
if 'href' in str(td):
pattern = re.compile('/categories/\w+')
p=pattern.search(str(td))
pu='http://www.malacards.org'+p.group(0)
ui.append(pu)
else:
ui.append('\n')
ui.append(td.string)
ulist.append(ui)
用bs4找到tr中的td。语法的参考博文都在注释中。不过笔者尝试了好久,作为tag的td一直没有可用属性,于是又引入了正则,把它的url提了出来。
def save_contents(urlist):
with open("E:/script/categories_of_malacards.csv", 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['disease lists'])
for i in range(len(urlist)):
writer.writerow([urlist[i][3], urlist[i][7], urlist[i][2]])
def main():
urli = []
url = "http://www.malacards.org/categories"
rs = check_link(url)
get_contents(urli, rs)
save_contents(urli)
main()
最后,调用csv保存,目录就完完整整地扣下来啦!
上一篇: webMagic 代理池