[爬虫] Python爬虫 | urllib | BeautifulSoup

程序员文章站 2022-05-03 20:06:21

...

开发文档与源码
安装BeautifulSoup
BeautifulSoup解析XML文档
简单安全爬虫

开发文档与源码

爬虫开源代码：https://github.com/REMitchell/python-scraping
urllib开发文档：https://docs.python.org/3/library/urllib.html
BeautifulSoup开发文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/
所需文件：http://pan.baidu.com/s/1i55olGL 密码：1985

安装BeautifulSoup

BeautifulSoup可以帮助你解析获取的文档，HTML或XML格式

下载版本
https://www.crummy.com/software/BeautifulSoup/bs4/download/
解压缩到Python的lib目录下
cmd进入beautifulsoup文件夹中，运行命令
```
python setup.py install
```
错误You are trying to run the Python 2 version of Beautiful Soup under Python 3. This will not work：
1. 把bs4文件夹解压到python/lib
2. 把python/Tools/scripts/2to3.py也放到lib目录中
3. cmd到python/lib文件夹下，运行2to3.py bs4 -w

记录：

2to3.py param1 (-w)

param1可以是要转换的.py文件、文件夹（文件及里的.py都会被转换）
-w可选，如果不写默认输出转换后的结果到显示屏，如果要把转换的文件再写入原文件

BeautifulSoup解析XML文档

读取文档

# 打开kmlFilePath文件('r'只读)
openKmlFile = open(kmlFilePath,"r")
# 读取到文件中文本
kmlDom = openKmlFile.read()
# 解析字符串，返回一个BeautifulSoup的对象
bsObj = BeautifulSoup(kmlDom,"html.parser")
#关闭文本
openKmlFile.close()

查找标签

bsObj.findAll("<body>") #返回bsObj里的所有<body>标签
bsObj.find("<body>") #返回bsObj里的第一个<body>标签

获得标签里的数据

bsObj.find("<body>").get_text()

简单安全爬虫

from urllib.request import urlopen
from urllib.error import HTTPError,URLError
from bs4 import BeautifulSoup
def getTitle(url):
    try:
        html = urlopen(url)
    except (HTTPError,URLError) as e:
        return None
    try:
        bsObj = BeautifulSoup(html)
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title
title = getTitle("url")
if title==None:
    print("Title cound not be found")
else:
    print(title)

结果
[爬虫] Python爬虫 | urllib | BeautifulSoup

上一篇： python-快速使用urllib爬取网页（4-GET）

下一篇： Yii中CGridView实现批量删除的方法

[爬虫] Python爬虫 | urllib | BeautifulSoup

开发文档与源码

安装BeautifulSoup

BeautifulSoup解析XML文档

读取文档

查找标签

获得标签里的数据

简单安全爬虫

python DataFrame 修改列的顺序实例

python网络爬虫学习笔记（1）

Python中call用法实例

解决Python的str强转int时遇到的问题

Python生成任意范围任意精度的随机数方法

python3 读写文件换行符的方法

Python3 实现随机生成一组不重复数并按行写入文件

Python比较2个时间大小的实现方法

Python基于TCP实现会聊天的小机器人功能示例

python基础教程之自定义函数介绍

[爬虫] Python爬虫 | urllib | BeautifulSoup

开发文档与源码

安装BeautifulSoup

BeautifulSoup解析XML文档

读取文档

查找标签

获得标签里的数据

简单安全爬虫

python DataFrame 修改列的顺序实例

python网络爬虫学习笔记（1）

Python中__call__用法实例

解决Python的str强转int时遇到的问题

Python生成任意范围任意精度的随机数方法

python3 读写文件换行符的方法

Python3 实现随机生成一组不重复数并按行写入文件

Python比较2个时间大小的实现方法

Python基于TCP实现会聊天的小机器人功能示例

python基础教程之自定义函数介绍

Python中call用法实例