python 网络爬虫第三章-爬取*（1）

程序员文章站 2022-05-07 23:08:28

...

3.1 遍历单个域名

目标：爬取Wikipedia Kevin Bacon网页的所有其他文章链接。

3.1.1 爬取任意*网页

示例代码：

from urllib.request import urlopen
from bs4 import BeautifulSoup


html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
soup = BeautifulSoup(html,'lxml')
for link in soup.find_all('a'):  #网页的所有链接都在‘a’标签下
    if 'href' in link.attrs:
        print(link.attrs['href']) #a标签下的href属性存放具体链接地址

输出结果如下：

....
/wiki/Michael_Douglas
/wiki/Miguel_Ferrer
/wiki/Albert_Finney
/wiki/Topher_Grace
....
https://wikimediafoundation.org/wiki/Privacy_policy
/wiki/Wikipedia:About
/wiki/Wikipedia:General_disclaimer
//en.wikipedia.org/wiki/Wikipedia:Contact_us
https://www.mediawiki.org/wiki/Special:MyLanguage/How_to_contribute
https://wikimediafoundation.org/wiki/Cookie_statement
//en.m.wikipedia.org/w/index.php?title=Kevin_Bacon&mobileaction=toggle_view_mobile
https://wikimediafoundation.org/
//www.mediawiki.org/
[Finished in 4.7s]

从结果可以看出所有的链接都在，有一些不是我们需要的。比如：

title=Kevin_Bacon&mobileaction=toggle_view_mobile
https://wikimediafoundation.org/
//www.mediawiki.org/

我们用inspect查看一下网页的结构，可以发现文章网页有如下特点：
1. 他们都在div->bodyContent标签下
2. 文章的URL不包含冒号“：”
3.文章的URL以"/wiki/"开始

这三个特点可以用正则表达式来说明。
1. soup.find('div',{'id':'bodyContent'})
2.regex = re.compile(r'((?!:).)*$') # ?!是不包含的意思。
3. regex = re.complie(r'^(/wiki/)')

所以改进代码如下：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re


html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
soup = BeautifulSoup(html,'lxml')
regex=re.compile(r"^(/wiki/)((?!:).)*$")
for link in soup.find('div',{'id':'bodyContent'}).find_all('a', href=regex ):
    if 'href' in link.attrs:
        print(link.attrs['href'])

结果如下：

/wiki/Kevin_Bacon_(disambiguation)
/wiki/San_Diego_Comic-Con
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Footloose_(1984_film)
....
/wiki/International_Standard_Name_Identifier
/wiki/Integrated_Authority_File
/wiki/Syst%C3%A8me_universitaire_de_documentation
/wiki/Biblioth%C3%A8que_nationale_de_France
/wiki/MusicBrainz
/wiki/Biblioteca_Nacional_de_Espa%C3%B1a
/wiki/SNAC
[Finished in 8.4s]

上一篇：杂乱无章 II

下一篇： simpleFramework把xml转对象，解析成功但是对象的参数全部为null

python 网络爬虫第三章-爬取*（1）

3.1 遍历单个域名

3.1.1 爬取任意*网页

详解用python写网络爬虫-爬取新浪微博评论

Python网络爬虫（selenium爬取动态网页、爬虫案例分析、哈希算法与RSA加密）

2019基于python的网络爬虫系列，爬取糗事百科

Python网络爬虫开发从环境搭建到实例爬取网页

python网络爬虫之解析网页的XPath(爬取Path职位信息)[三]

详解用python写网络爬虫-爬取新浪微博评论

使用python爬虫实现网络股票信息爬取的demo

Python爬虫实例_城市公交网络站点数据的爬取方法

Python网络爬虫之爬取微博热搜

Python网络爬虫（selenium爬取动态网页、爬虫案例分析、哈希算法与RSA加密）