[爬虫] Python爬虫 | urllib | BeautifulSoup
程序员文章站
2022-05-03 20:06:21
...
开发文档与源码
爬虫开源代码:https://github.com/REMitchell/python-scraping
urllib开发文档:https://docs.python.org/3/library/urllib.html
BeautifulSoup开发文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
所需文件:http://pan.baidu.com/s/1i55olGL 密码:1985
安装BeautifulSoup
BeautifulSoup可以帮助你解析获取的文档,HTML或XML格式
- 下载版本
https://www.crummy.com/software/BeautifulSoup/bs4/download/ - 解压缩到Python的lib目录下
-
cmd进入beautifulsoup文件夹中,运行命令
python setup.py install
错误You are trying to run the Python 2 version of Beautiful Soup under Python 3. This will not work:
- 把bs4文件夹解压到python/lib
- 把python/Tools/scripts/2to3.py也放到lib目录中
- cmd到python/lib文件夹下,运行
2to3.py bs4 -w
记录:
2to3.py param1 (-w)
param1可以是要转换的.py文件、文件夹(文件及里的.py都会被转换)
-w可选,如果不写默认输出转换后的结果到显示屏,如果要把转换的文件再写入原文件
BeautifulSoup解析XML文档
读取文档
# 打开kmlFilePath文件('r'只读)
openKmlFile = open(kmlFilePath,"r")
# 读取到文件中文本
kmlDom = openKmlFile.read()
# 解析字符串,返回一个BeautifulSoup的对象
bsObj = BeautifulSoup(kmlDom,"html.parser")
#关闭文本
openKmlFile.close()
查找标签
bsObj.findAll("<body>") #返回bsObj里的所有<body>标签
bsObj.find("<body>") #返回bsObj里的第一个<body>标签
获得标签里的数据
bsObj.find("<body>").get_text()
简单安全爬虫
from urllib.request import urlopen
from urllib.error import HTTPError,URLError
from bs4 import BeautifulSoup
def getTitle(url):
try:
html = urlopen(url)
except (HTTPError,URLError) as e:
return None
try:
bsObj = BeautifulSoup(html)
title = bsObj.body.h1
except AttributeError as e:
return None
return title
title = getTitle("url")
if title==None:
print("Title cound not be found")
else:
print(title)
结果