爬虫BeautifulSoup模块（下）

程序员文章站 2022-08-06 21:09:45

BeautifulSoup模块介绍BeautifulSoup是一个可以从html或xml文件中提取数据的python库。BeautifulSoup安装：可以直接在pycharm中terminal直接输入pip install BeautifulSoup4或者在File->settings->project interpreter->按+号搜索添加bs4。代码from bs4 import BeautifulSoupimport refile = open("./baidu....

BeautifulSoup模块介绍

BeautifulSoup是一个可以从html或xml文件中提取数据的python库。

BeautifulSoup安装：

可以直接在pycharm中terminal直接输入pip install BeautifulSoup4或者在File->settings->project interpreter->按+号搜索添加bs4。

代码

from bs4 import BeautifulSoup
import re

file = open("./baidu.html","rb")
html = file.read()
bs = BeautifulSoup(html,"html.parser")
print(bs.title) #1.Tag 标签及其内容,拿到它找到的第一个内容

print(bs.div.attrs) #bs.a.attrs获取a标签所有属性，返回一个字典获取a标签的所有属性

print(bs.a.string) #去注释

#文档遍历
#print(bs.head.contents[1])
#文档搜索
#1.find_all()
#字符串过滤，会查找字符串完全匹配的内容
#t_list = bs.find_all("a")
#print(t_list)
#t_list = bs.find_all(re.compile("a"))#s所有含a的内容
#方法:传入函数来搜索

# def name_is_exists(tag):
#     return tag.has_attr("name")
# t_list = bs.find_all(name_is_exists)
# print(t_list)

#2.kwargs
# t_list = bs.find_all(class_=True)
# for iteam in t_list:
#     print(iteam)
#3.txt文本参数
# t_list = bs.find_all(text =["hao123","地图"])
# for iteam in t_list:
#      print(iteam)
#4.limit参数
# t_list = bs.find_all("a",limit=3)
# for iteam in t_list:
#     print(iteam)
#css选择器
# t_list = bs.select('title')#通过标签查找
# t_list = bs.select(".mnav")#通过类名查找
# t_list = bs.select("#v1")#通过id查找
# t_list = bs.select("a[class='bri']")#通过属性查找
# t_list = bs.select("head > title")#通过子标签查找
# for iteam in t_list:
#     print(iteam)

分析：

代码是爬取百度首页代码进行分析，分别运行了BeautifulSoup一些功能，特别是遍历文本，以及
按各种查找数据的功能。代码可以复制，但运行
时去除注释可以每个功能独立运行。可以更好的
体现BeautifulSoup的功能。

本文地址：https://blog.csdn.net/weixin_48106407/article/details/107645086

上一篇： Python 爬虫爬取指定博客的所有文章

下一篇： Spring与Struts整合之使用自动装配操作示例

爬虫BeautifulSoup模块（下）

BeautifulSoup模块介绍

BeautifulSoup安装：

代码

分析：

使用Python的urllib和urllib2模块制作爬虫的实例教程

Python中urllib+urllib2+cookielib模块编写爬虫实战

使用Python编写爬虫的基本模块及框架使用指南

Python中使用urllib2模块编写爬虫的简单上手示例

Python爬虫辅助利器PyQuery模块的安装使用攻略

爬虫入门之Requests模块学习(四)

练手爬虫用urllib模块获取

python用BeautifulSoup库简单爬虫实例分析

Python爬虫包BeautifulSoup实例（三）

使用httplib模块来制作Python下HTTP客户端的方法