爬虫urllib使用

程序员文章站 2022-05-03 20:05:33

...

爬虫urllib使用

request 和parse使用

request 和parse使用

from urllib import request
#例如爬取 百度首页
#直接爬取 https://www.baidu.com/ 
html_obj=request.urlopen("https://www.baidu.com/ ")
#然后读取爬取的内容 并以utf-8转码
html_content=html_obj.read().decode("utf-8")
print(html_content)
#发现内容几乎为空,所以我们需要伪装浏览器
#在请求头上面给点伪装信息 User-Agent 浏览器标识
headers={
	"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
}
#封装
url="https://www.baidu.com/ "
req=request.Request(url=url,headers=headers)
#再发起请求
html_second=request.urlopen(req)
html_content_second=html_second.read().decode("utf-8")
print(html_content_second)

提交爬取

from urllib import request,parse

base_url="https://tieba.baidu.com/f?"
headers={
         "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/70.0.3538.110 Safari/537.36",
    }

kw=input("输入：")

#这是要提交的数据
data={
        "ie":"utf-8",
        "kw":kw,
        "fr":"search",
    }

#使用parse.urlencode() 对提交的数据进行转换
data_str=parse.urlencode(data)
#这个是get提交所以直接拼接在url上面
url=base_url+data_str

req=request.Request(url=url,headers=headers)
html=request.urlopen(req).read().decode("utf-8")
#创建文件名
file_name="%s.html"%(kw)
#把html代码写进文件
with open(file_name,"w",encoding="utf-8") as f:
    f.write(html)

这就是简单的urllib的使用，后续继续更新！

相关标签：爬虫 urllib requset parse head'er's

上一篇： python：列表的使用

下一篇： Python：字典的使用

爬虫urllib使用

爬虫urllib使用

request 和parse使用

MySQL concat函数的使用

PHP使用BLOB存取图片信息实例

detectron2使用自定义的数据集

比较全的PHP 会话(session 时间设定)使用入门代码

TabLayout的简单使用

PHP_Curl使用详解

Python3数字中使用下划线

caffe使用命令行方式训练预测mnist、cifar10及自己的数据集

Oracle通配符,运算符的使用, 冒号问题

python使用htmllib分析网页内容的方法