用python3爬取百度首页

程序员文章站 2022-07-12 22:13:45

...

用python3读取百度首页

代码

爬取百度首页

import urllib.request
import urllib

url="http://www.baidu.com/"
html=urllib.request.urlopen(url)
content=html.read().decode('utf-8')
#html_text=bytes.decode(html.read())
#print(html_text)
print(content)

读取百度首页中的标题

在控制台输入pip install bs4安装BeautifulSoup

from urllib.request import urlopen
from bs4 import BeautifulSoup as bf
html=urlopen("http://www.baidu.com/")
obj=bf(html.read(),'html.parser')
print(obj.head.title)

提取百度logo

获取所有图片信息

from urllib.request import urlopen
from bs4 import BeautifulSoup as bf
html = urlopen("http://www.baidu.com/")
obj = bf(html.read(),'html.parser')
title=obj.head.title
pic_info = obj.find_all('img')
#分别打印每个图片的信息
for i in pic_info:
    print(i)

运行完得到所有图片的信息结果，包含了所有图片的属性
用python3爬取百度首页

获取logo图片的链接地址

from urllib.request import urlopen
from bs4 import BeautifulSoup as bf
html = urlopen("http://www.baidu.com/")
obj = bf(html.read(),'html.parser')
title=obj.head.title
pic_info = obj.find_all('img')

logo_pic_info=obj.find_all('img',class_="index-logo-src")
logo_url="http:"+logo_pic_info[0]['src']
print(logo_url)

得到的logo地址如下所示
用python3爬取百度首页

根据链接地址下载logo文件

from urllib.request import urlopen
from bs4 import BeautifulSoup as bf
from urllib.request import urlretrieve
html = urlopen("http://www.baidu.com/")
obj = bf(html.read(),'html.parser')
title=obj.head.title
pic_info = obj.find_all('img')
logo_pic_info=obj.find_all('img',class_="index-logo-src")
logo_url="http:"+logo_pic_info[0]['src']

urlretrieve(logo_url,'logo.png')

成功获取logo图片，命名为logo.png
用python3爬取百度首页

神奇地发现，百度的logo已经因为疫情做了改造，致敬所有的一线工作人员。

总结

本文列举了应用python3爬取百度首页、读取网页标题、提取网页logo的三段代码，其中用到的函数有：

采用urllib中的request.urlopen 读取网页内容
用bytes.decode可以将网页内容转换为字节
采用bs4将网页内容结构化，方便读取
BeautifulSoup中的find_all方法可以提取包含在图片标签里的信息。
urllib中的request.urlretrieve用于下载链接内容并保存

参考资料

1. 有哪些足不出户，能用十天左右时间掌握的新技能？ - 朱*的回答 - 知乎
 2. 【python爬虫】之爬取百度首页

用python3爬取百度首页

用python3读取百度首页

代码

总结

参考资料

Python3爬虫爬取英雄联盟高清桌面壁纸功能示例【基于Scrapy框架】

利用xpath爬取百度贴吧内容返回空列表的问题

python爬取百度贴吧的实例

教你如何利用python3爬虫爬取漫画岛-非人哉漫画

Python无头浏览器使用根据关键词爬取百度资讯

我用 Python 爬取微信好友，最后发现一个大秘密

用python爬取豆瓣前一百电影

python3 爬取图片的实例代码

使用php的curl依据关键词爬取百度搜索结果页

python3爬虫-通过selenium登陆拉钩，爬取职位信息

用python3爬取百度首页

用python3读取百度首页

代码

总结

参考资料

Python3爬虫爬取英雄联盟高清桌面壁纸功能示例【基于Scrapy框架】

利用xpath爬取百度贴吧内容返回空列表的问题

python爬取百度贴吧的实例

教你如何利用python3爬虫爬取漫画岛-非人哉漫画

Python无头浏览器使用 根据关键词爬取百度资讯

我用 Python 爬取微信好友，最后发现一个大秘密

用python爬取豆瓣前一百电影

python3 爬取图片的实例代码

使用php的curl依据关键词爬取百度搜索结果页

python3爬虫-通过selenium登陆拉钩，爬取职位信息

Python无头浏览器使用根据关键词爬取百度资讯