用bs4完成html中标签中文本的爬取
程序员文章站
2022-06-15 13:41:37
...
其实还是为了昨天的工作(https://blog.csdn.net/Emmett_Bioinfo/article/details/114590394)。经过我查阅了一些资料,我发现用Selenium来做昨天这件事确实是大材小用了,因为获得文本内容其实只需要网页源代码,根本不需要把网页全部显示出来,昨天的做法对于这件事来说是又慢又吃力不讨好。
今天学习了一下beautiful soup4干了这件事,代码如下:
#!/bin/python3
#coding=utf-8
#from selenium import webdriver
import time
import requests
from bs4 import BeautifulSoup as bs
#from selenium.webdriver.chrome.options import Options
#from selenium.common.exceptions import NoSuchElementException
def get_result(formula):
requests.adapters.DEFAULT_RETRIES = 5
s = requests.session()
s.keep_alive = False
r = s.get("http://www.chemspider.com/Search.aspx?q=" + formula)
soup = bs(r.content, 'html.parser')
h3_tag = soup.select('h3')
result = h3_tag[0].string
return result
def process_file(file):
formula_list = [line.rstrip('\n') for line in open(file)]
length = len(formula_list)
#print(formula_list, length)
return formula_list, length
def main():
formula_list, length = process_file('E:/Denglab/代码整理/查看网页中元素/lists all.txt')
count = 0
for formula in formula_list:
result = get_result(formula)
count = count + 1
with open('E:/Denglab/代码整理/查看网页中元素/output_bs.txt', 'a+') as f:
f.write(formula)
f.write('\t')
f.write(result)
f.write('\n')
print('Output: ' + str(count) + '/' + str(length))
if __name__ =="__main__":
main()