欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

[爬虫] Python爬虫 | urllib | BeautifulSoup

程序员文章站 2022-05-03 20:06:21
...

开发文档与源码

爬虫开源代码:https://github.com/REMitchell/python-scraping
urllib开发文档:https://docs.python.org/3/library/urllib.html
BeautifulSoup开发文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
所需文件:http://pan.baidu.com/s/1i55olGL 密码:1985

安装BeautifulSoup

BeautifulSoup可以帮助你解析获取的文档,HTML或XML格式

  1. 下载版本
    https://www.crummy.com/software/BeautifulSoup/bs4/download/
  2. 解压缩到Python的lib目录下
  3. cmd进入beautifulsoup文件夹中,运行命令

    python setup.py install
    

    错误You are trying to run the Python 2 version of Beautiful Soup under Python 3. This will not work:

    1. 把bs4文件夹解压到python/lib
    2. 把python/Tools/scripts/2to3.py也放到lib目录中
    3. cmd到python/lib文件夹下,运行2to3.py bs4 -w

记录:

2to3.py param1 (-w)

param1可以是要转换的.py文件、文件夹(文件及里的.py都会被转换)
-w可选,如果不写默认输出转换后的结果到显示屏,如果要把转换的文件再写入原文件

BeautifulSoup解析XML文档

读取文档

# 打开kmlFilePath文件('r'只读)
openKmlFile = open(kmlFilePath,"r")
# 读取到文件中文本
kmlDom = openKmlFile.read()
# 解析字符串,返回一个BeautifulSoup的对象
bsObj = BeautifulSoup(kmlDom,"html.parser")
#关闭文本
openKmlFile.close() 

查找标签

bsObj.findAll("<body>") #返回bsObj里的所有<body>标签
bsObj.find("<body>") #返回bsObj里的第一个<body>标签

获得标签里的数据

bsObj.find("<body>").get_text()

简单安全爬虫

from urllib.request import urlopen
from urllib.error import HTTPError,URLError
from bs4 import BeautifulSoup
def getTitle(url):
    try:
        html = urlopen(url)
    except (HTTPError,URLError) as e:
        return None
    try:
        bsObj = BeautifulSoup(html)
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title
title = getTitle("url")
if title==None:
    print("Title cound not be found")
else:
    print(title)

结果
[爬虫] Python爬虫 | urllib | BeautifulSoup