初学python爬虫，bs4解析后print(bs,h1)返回None的原因和解决方案

程序员文章站 2024-03-13 23:04:23

...

本人用的python3.7，代码在anacoda 3.7版和自装的bs4 4.9.1都成功测试。

初学爬虫，结果第一个BeautifulSoup的实例就运行失败，print(bs,h1)返回None，但原网页明明就有h1标签。

比如下面的代码。

from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
print(html.read())

如果页面OK，返回的是

“b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum……”这样的。

但我们直接加bs4解析代码就会出问题，比如这样：

from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
print(html.read())
#以下是新加的
bs = BeautifulSoup(html, 'html.parser')
print(bs.h1)

返回的是：

“b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum……

None”

不对啊！！为啥是None？

为什么print(bs.h1)会返回None呢？是因为我们试图读取了html对象多次。不知道为什么，html.read()执行以后，再次执行，他就只会返回空的字符串。这应该是对初学者的一个坑。不信的话，可以执行以下代码：

from urllib.request import urlopen
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
print(html.read())
print(html.read())

应该第二个print只会输出b''。

那么解决方法就有两种了。一是，严格的只读一次数据，放到其他字符串里；二是，使用requests包（需要另外用pip装）。

先看第一种。以下代码应该能让bs4正常工作了。

from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
html2 = html.read()
print(html2)
print(html2)
bs = BeautifulSoup(html2, 'html.parser')
print(bs.h1)

而更好的方法是用requests库。

requests库安装好以后，没有这个bug，轻松使用（把get当urlopen，.content当read()）：

from bs4 import BeautifulSoup
import requests
rsp = requests.get('http://www.pythonscraping.com/pages/page1.html')
print(rsp.content)
bs = BeautifulSoup(rsp.content, 'html.parser')
print(bs.h1)

这样你就会看到正常解析结果：

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum……

<h1>An Interesting Title</h1>

了。

注意事项：

如果python是直接装anaconda，那么requests库和bs4都是自带的；

urlopen的网址可以换，我用的《python网络爬虫权威指南》的示例网址。

参考资料：

《python网络爬虫权威指南（第2版）》

最后，度娘经验说我这样写太简单了没图不给过？不好意思代码我懒得截屏，就放CSDN了，你们还能复制，多好。