用bs4完成html中标签中文本的爬取

程序员文章站 2022-06-15 13:41:37

...

其实还是为了昨天的工作（https://blog.csdn.net/Emmett_Bioinfo/article/details/114590394）。经过我查阅了一些资料，我发现用Selenium来做昨天这件事确实是大材小用了，因为获得文本内容其实只需要网页源代码，根本不需要把网页全部显示出来，昨天的做法对于这件事来说是又慢又吃力不讨好。

今天学习了一下beautiful soup4干了这件事，代码如下：

#!/bin/python3
#coding=utf-8

#from selenium import webdriver
import time
import requests
from bs4 import BeautifulSoup as bs
#from selenium.webdriver.chrome.options import Options
#from selenium.common.exceptions import NoSuchElementException

def get_result(formula):
	requests.adapters.DEFAULT_RETRIES = 5
	s = requests.session()
	s.keep_alive = False
	r = s.get("http://www.chemspider.com/Search.aspx?q=" + formula)
	soup = bs(r.content, 'html.parser')
	h3_tag = soup.select('h3')
	result = h3_tag[0].string

	return result

def process_file(file):

	formula_list = [line.rstrip('\n') for line in open(file)]
	length = len(formula_list)
	#print(formula_list, length)
	return formula_list, length

def main():
	formula_list, length = process_file('E:/Denglab/代码整理/查看网页中元素/lists all.txt')
	count = 0
	for formula in formula_list:
		result = get_result(formula)
		count = count + 1
		with open('E:/Denglab/代码整理/查看网页中元素/output_bs.txt', 'a+') as f:
			f.write(formula)
			f.write('\t')
			f.write(result)
			f.write('\n')
		print('Output: ' + str(count) + '/' + str(length))


if __name__ =="__main__":
	main()

相关标签：杂七杂八的Python小代码 python

上一篇： PHP 5.4.4和PHP 5.3.14发布，修复安全漏洞

下一篇： Android上使用Lombok和set、get方法告别

用bs4完成html中标签中文本的爬取

用bs4完成html中标签中文本的爬取

CSS样式：把一个段中的几个短语颜色设置成不同于文本的颜色，用span标签。_html/css_WEB-ITnose