python 网络爬虫第三章-爬取*（2）

程序员文章站 2022-05-07 23:09:22

...

3.1.2 随机打开网页中的文章链接

目标：随机漫步从一个网页随机跳转到该网页中的链接，如此循环。
示例代码如下：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re


random.seed(datetime.datetime.now()) #seed 不同，random的结果就会不同。反之每次seed值相同，random的结果就会相同。
#因此用当前时间保证每次运行的random结果都不相同。


def get_links(article_url):
    html = urlopen("http://en.wikipedia.org" + article_url)
    soup = BeautifulSoup(html, 'lxml')
    regex = re.compile(r"^(/wiki/)((?!:).)*$")
    return soup.find('div', {'id': 'bodyContent'}).find_all('a', href=regex)


links = get_links("/wiki/Kevin_Bacon")
sum = 0

while len(links) > 0:
    new_article = links[random.randint(0, len(links) - 1)].attrs['href']#随机选取一个文章链接
    print(new_article)#打印该链接的地址
    links = get_links(new_article)#获取该随机网页下所有的文章链接，循环进行。
    sum += 1

print(sum)

每次运行的结果都是随机的，因此每个人的运行结果也是不一样的。由于代码没有异常处理以及处理反爬虫机智，因此可以肯定一定会报错。

....
urllib.error.URLError: <urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>
[Finished in 501.6s]

上一篇： xpath解析数据（爬取全国城市名称）

下一篇： MySQL 5.7以及MySQL Workbench使用

python 网络爬虫第三章-爬取*（2）

3.1.2 随机打开网页中的文章链接

Python网络爬虫（selenium爬取动态网页、爬虫案例分析、哈希算法与RSA加密）

2019基于python的网络爬虫系列，爬取糗事百科

Python网络爬虫开发从环境搭建到实例爬取网页

python网络爬虫之解析网页的XPath(爬取Path职位信息)[三]

详解用python写网络爬虫-爬取新浪微博评论

【Python爬虫案例学习2】python多线程爬取youtube视频

使用python爬虫实现网络股票信息爬取的demo

Python爬虫实例_城市公交网络站点数据的爬取方法

Python网络爬虫之爬取微博热搜

Python：爬虫实例2：爬取猫眼电影——破解字体反爬