欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

用bs4完成html中标签中文本的爬取

程序员文章站 2022-06-15 13:41:37
...

其实还是为了昨天的工作(https://blog.csdn.net/Emmett_Bioinfo/article/details/114590394)。经过我查阅了一些资料,我发现用Selenium来做昨天这件事确实是大材小用了,因为获得文本内容其实只需要网页源代码,根本不需要把网页全部显示出来,昨天的做法对于这件事来说是又慢又吃力不讨好。

今天学习了一下beautiful soup4干了这件事,代码如下:

#!/bin/python3
#coding=utf-8

#from selenium import webdriver
import time
import requests
from bs4 import BeautifulSoup as bs
#from selenium.webdriver.chrome.options import Options
#from selenium.common.exceptions import NoSuchElementException

def get_result(formula):
	requests.adapters.DEFAULT_RETRIES = 5
	s = requests.session()
	s.keep_alive = False
	r = s.get("http://www.chemspider.com/Search.aspx?q=" + formula)
	soup = bs(r.content, 'html.parser')
	h3_tag = soup.select('h3')
	result = h3_tag[0].string

	return result

def process_file(file):

	formula_list = [line.rstrip('\n') for line in open(file)]
	length = len(formula_list)
	#print(formula_list, length)
	return formula_list, length

def main():
	formula_list, length = process_file('E:/Denglab/代码整理/查看网页中元素/lists all.txt')
	count = 0
	for formula in formula_list:
		result = get_result(formula)
		count = count + 1
		with open('E:/Denglab/代码整理/查看网页中元素/output_bs.txt', 'a+') as f:
			f.write(formula)
			f.write('\t')
			f.write(result)
			f.write('\n')
		print('Output: ' + str(count) + '/' + str(length))


if __name__ =="__main__":
	main()