网络爬虫,数据处理(将br/换成换行符)

程序员文章站 2024-03-11 11:50:31

...

目标是从网址https://zhidao.baidu.com/question/1302141487765288859.html上爬取排名的信息:

code:

  1 #coding=utf-8
  2 from urllib.request import urlopen    #导入urlopen模块和BeautifulSoup模块
  3 from bs4 import BeautifulSoup
  4 html=urlopen("https://zhidao.baidu.com/question/1302141487765288859.html").r    ead()     #打开网址并将读取的内容传给html
  5 soup=BeautifulSoup(html,features='lxml')    #将html传给BeautifulSoup,并用"lx    ml"解析
  6 all_pre=soup.select("pre")     #选取'pre'标签' 
  7 #print(all_pre)
  8 for l in all_pre:
  9     print(l.get_text())     #去掉修饰后输出

网络爬虫,数据处理(将br/换成换行符)

用下面命令输出以上结果:

print(all_pre)

网络爬虫,数据处理(将br/换成换行符)

可以发现规律,只要将'<br/>'换成'\n'就可以了

clde:

  1 #coding=utf-8
  2 from urllib.request import urlopen    #导入urlopen模块和BeautifulSoup模块
  3 from bs4 import BeautifulSoup
  4 html=urlopen("https://zhidao.baidu.com/question/1302141487765288859.html").read()     #打开网址并将读取的内容传给html
  5 soup=BeautifulSoup(html,features='lxml')    #将html传给BeautifulSoup,并用"lxml"解析
  6 all_pre=soup.select("pre")     #选取'pre'标签' 
  7 #print(all_pre)
  8 #for l in all_pre:
  9 #   print(l.get_text())     #去掉修饰后输出
 10 
 11     
 12 s = str(all_pre)    #转换成字符串
 13 s_replace = s.replace('<br/>',"\n")    #用换行符替换'<br/>'
 14 while True:                      #用换行符替换所有的'<br/>'
 15     index_begin = s_replace.find("<")
 16     index_end = s_replace.find(">",index_begin + 1)
 17     if index_begin == -1:
 18         break
 19     s_replace = s_replace.replace(s_replace[index_begin:index_end+1],"")
 20 #print(type(s_replace))
 21 print(s_replace)

输出为:

网络爬虫,数据处理(将br/换成换行符)

可以发现数据已经成为我们想要的类型