二、网络爬虫之提取（1）

程序员文章站 2022-04-16 08:20:31

Beautiful Soup库入门...

Beautiful Soup库入门

1.Beautiful Soup库的安装
2.Beautiful Soup库的基本元素
3.Beautiful Soup库的引用
4.BeautifulSoup类
5.Beautiful Soup库解析器
6.BeautifulSoup类的基本元素

Tag 标签
Tag的name（名字）
Tag的attrs（属性）
Tag的NavigableString
Tag的Comment

7.基于bs4库的HTML内容遍历方法

标签树的下行遍历
标签树的上行遍历
标签树的平行遍历
总结

8.基于bs4库的HTML格式输出

bs4库的prettify()方法
bs4库的编码

1.Beautiful Soup库的安装

pip install beautifulsop4

测试

>>> import requests
>>> r = requests.get('https://python123.io/ws/demo.html')
>>> r.text
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> demo = r.text #demo将在后面使用
>>> from bs4 import BeautifulSoup
>>> soup =BeautifulSoup(demo,'html.parser')
>>> print(soup)
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>>

主要代码：

form bs4 import BeautifulSoup
soup = BeautifulSoup('<p>data</p>','html.parser')

2.Beautiful Soup库的基本元素

二、网络爬虫之提取（1）

3.Beautiful Soup库的引用

from bs4 import BeautifulSoup

Beautiful Soup库，也叫beautifulsoup4或 bs4约定引用方式如下，即主要是用BeautifulSoup类

import bs4

4.BeautifulSoup类

二、网络爬虫之提取（1）

5.Beautiful Soup库解析器

soup = BeautifulSoup('<html>data</html>'，'html.parser')

二、网络爬虫之提取（1）

6.BeautifulSoup类的基本元素

<p class=“title”>… </p>

二、网络爬虫之提取（1）

Tag 标签

Tag 标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾

>>> from bs4 import BeautifulSoup
>>>> soup = BeautifulSoup(demo,'html.parser')
>>>> soup.title
<title>This is a python demo page</title>
>>> tag = soup.a
>>> tag
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

任何存在于HTML语法中的标签都可以用soup.<tag >访问获得
当HTML文档中存在多个相同<tag>对应内容时，soup.<tag>返回第一个

Tag的name（名字）

Name 标签的名字，<p>…</p>的名字是’p’，格式：<tag>.name

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,'html.parser')
>>> soup.a.name
'a'
>>> soup.a.parent.name
'p'
>>> soup.a.parent.parent.name
'body'
>>>

Tag的attrs（属性）

>>> tag = soup.a
>>> tag.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> tag.attrs['class']
['py1']
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'
>>> type(tag.attrs)
<class 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>
>>>

一个<tag>可以有0或多个属性，字典类型

Tag的NavigableString

NavigableString 标签内非属性字符串，<>…</>中字符串，格式：<tag>.string

>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.string
'Basic Python'
>>> soup.p
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> soup.p.string
'The demo python introduces several python courses.'
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>
>>>

NavigableString可以跨越多个层次

Tag的Comment

Comment 标签内字符串的注释部分，一种特殊的Comment类型

>>> newsoup = BeautifulSoup('<b><!--this is a comment--></b><p>this is a comment</p>','html.parser')
>>> newsoup.b.string
'this is a comment'
>>> type(newsoup.b.string)
<class 'bs4.element.Comment'>
>>> newsoup.p.string
'this is a comment'
>>> type(newsoup.p.string)
<class 'bs4.element.NavigableString'>
>>>

7.基于bs4库的HTML内容遍历方法

>>> import requests
>>> r = requests.get('https://python123.io/ws/demo.html')
>>> demo = r.text
>>>demo
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>>

HTML基本格式

<html>
	<head>
		<title>This is a python demo page</title>
	</head>
	<body>
		<p class="title"><b>The demo python introduces several 			python courses.</b>
		</p>
		<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
			<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and 
			<a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.
			</p>
		</body>
	</html>

二、网络爬虫之提取（1）

标签树的下行遍历

二、网络爬虫之提取（1）

>>> soup = BeautifulSoup(demo,'html.parser')
>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> soup.body.contents
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
>>> len(soup.body.contents)
5
>>> soup.body.contents[1]
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>>

遍历儿子节点：

>>> for child in soup.body.children:
	print(child)

b遍历子孙节点：

for child in soup.body.descendants:
	print(child)

标签树的上行遍历

二、网络爬虫之提取（1）

>>> soup = BeautifulSoup(demo,'html.parser')
>>> soup.title.parent
<head><title>This is a python demo page</title></head>
>>> soup.html.parent
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.parent
>>>

>>> soup = BeautifulSoup(demo,'html.parser')
>>> for parent in soup.a.parents:
	if parent is None:
		print(parent)
	else:
		print(parent.name)

p
body
html
[document]

标签树的平行遍历

二、网络爬虫之提取（1）

>>> soup = BeautifulSoup(demo,'html.parser')
>>> soup.a.nextSibling
' and '
>>> soup.a.nextSibling.nextSibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
>>> soup.a.previousSibling
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
>>> soup.a.previousSibling.previousSibling
>>> soup.a.parent
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
>>>

遍历后序节点

for sibling in soup.a.next_sibling:
	print(sibling)

遍历前续节点

for sibling in soup.a.previous_sibling:
	print(sibling)

总结

二、网络爬虫之提取（1）

8.基于bs4库的HTML格式输出

能否让HTML内容更加“友好”的显示？

bs4库的prettify()方法

>>> soup = BeautifulSoup(demo,'html.parser')
>>>> soup.prettify()
'<html>\n <head>\n  <title>\n   This is a python demo page\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    The demo python introduces several python courses.\n   </b>\n  </p>\n  <p class="course">\n   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n    Basic Python\n   </a>\n   and\n   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n    Advanced Python\n   </a>\n   .\n  </p>\n </body>\n</html>'
>>> print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>
>>> print(soup.a.prettify())
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
 Basic Python
</a>

>>>

bs4库的编码

python3默认utf-8：

>>> soup1 =BeautifulSoup('<p>中文</p>','html.parser')
>>> soup1.p.string
'中文'
>>> print(soup1.p.prettify())
<p>
 中文
</p>
>>>

本文地址：https://blog.csdn.net/HolllllldOn/article/details/107361170

上一篇：【字节面试题】牛客刷题偶遇字节后端笔试面经撸它

下一篇： Apache里如何将图片解析成PHP

二、网络爬虫之提取（1）

Beautiful Soup库入门

1.Beautiful Soup库的安装

2.Beautiful Soup库的基本元素

3.Beautiful Soup库的引用

4.BeautifulSoup类

5.Beautiful Soup库解析器

6.BeautifulSoup类的基本元素

Tag 标签

Tag的name（名字）

Tag的attrs（属性）

Tag的NavigableString

Tag的Comment

7.基于bs4库的HTML内容遍历方法

标签树的下行遍历

标签树的上行遍历

标签树的平行遍历

总结

8.基于bs4库的HTML格式输出

bs4库的prettify()方法

bs4库的编码

从零学习node.js之简易的网络爬虫（四）

python网络爬虫学习笔记（1）

python网络爬虫和文档内容提取

详解Python3网络爬虫(二)：利用urllib.urlopen向有道翻译发送数据获得翻译结果

Ocelot简易教程（二）之快速开始1

从零学习node.js之简易的网络爬虫（四）

python网络爬虫之解析网页的XPath(爬取Path职位信息)[三]

python网络爬虫之如何伪装逃过反爬虫程序的方法

【学习笔记】PYTHON网络爬虫与信息提取(北理工嵩天)

python网络爬虫学习笔记（1）

二、网络爬虫之提取（1）

Beautiful Soup库入门

1.Beautiful Soup库的安装

2.Beautiful Soup库的基本元素

3.Beautiful Soup库的引用

4.BeautifulSoup类

5.Beautiful Soup库解析器

6.BeautifulSoup类的基本元素

Tag 标签

Tag的name（名字）

Tag的attrs（属性）

Tag的NavigableString

Tag的Comment

7.基于bs4库的HTML内容遍历方法

标签树的下行遍历

标签树的上行遍历

标签树的平行遍历

总结

8.基于bs4库的HTML格式输出

bs4库的prettify()方法

bs4库的编码

从零学习node.js之简易的网络爬虫（四）

python网络爬虫学习笔记（1）

python网络爬虫和文档内容提取

详解Python3网络爬虫(二)：利用urllib.urlopen向有道翻译发送数据获得翻译结果

Ocelot简易教程（二）之快速开始1

从零学习node.js之简易的网络爬虫（四）

python网络爬虫之解析网页的XPath(爬取Path职位信息)[三]

python网络爬虫之如何伪装逃过反爬虫程序的方法

【学习笔记】PYTHON网络爬虫与信息提取(北理工 嵩天)

python网络爬虫学习笔记（1）

【学习笔记】PYTHON网络爬虫与信息提取(北理工嵩天)