python爬虫 - 使用urllib

程序员文章站 2022-05-03 20:05:03

...

urlopen()

urllib.request 模块提供了最基本的构造 HTTP 请求的方法，利用它可以模拟浏览器的一个请求发起过程，同时它还带有处理授权验证，重定向，cookies 以及其它内容。我们来打开httpbin.org这个测试网站。

import urllib.request

response = urllib.request.urlopen('https://httpbin.org/get')
print(response.read().decode('utf-8'))

运行结果如下：

{
  "args": {}, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.7"
  }, 
  "origin": "116.227.107.42", 
  "url": "https://httpbin.org/get"
}

看下它返回的到底是什么?

import urllib.request

response = urllib.request.urlopen('https://httpbin.org/get')
print(type(response))

运行结果如下：

<class 'http.client.HTTPResponse'>

这是一个 HTTPResposne 类型的对象，它主要包含的方法有 read([amt])、readinto(b)、getheader(name, default=None)、getheaders()、fileno() 方法和 msg、version、status、reason、debuglevel、closed 属性。

得到这个对象之后，我们把它赋值为 response 变量，然后就可以调用这些方法和属性，得到返回结果的一系列信息了。

例如调用 read() 方法可以得到返回的网页内容，调用 status 属性就可以得到返回结果的状态码，如 200 代表请求成功，404 代表网页未找到等。

实例感受一下：

import urllib.request

response = urllib.request.urlopen('https://httpbin.org/get')
print(response.status)
print(response.getheaders())

运行结果如下：

200
[('Connection', 'close'), ('Server', 'gunicorn/19.9.0'), ('Date', 'Thu, 31 Jan 2019 12:13:29 GMT'), ('Content-Type', 'application/json'), ('Content-Length', '236'), ('Access-Control-Allow-Origin', '*'), ('Access-Control-Allow-Credentials', 'true'), ('Via', '1.1 vegur')]

利用最基本的 urlopen() 方法，我们可以完成最基本的简单网页的 GET 请求抓取。

上一篇：解析168开奖网在JQuery中each方法开奖网源码的使用

下一篇：【python】urllib库（爬虫）

python爬虫 - 使用urllib

urlopen()

使用python制作一个为hex文件增加版本号的脚本实例

python3.4+pycharm 环境安装及使用方法

使用python批量化音乐文件格式转换的实例

Python的Django框架中的Context使用

Python的Django框架中if标签的相关使用

python实现爬虫下载美女图片

在Python的Django框架中创建和使用模版

Python中@property的理解和使用示例

使用Python实现将list中的每一项的首字母大写

简单介绍使用Python解析并修改XML文档的方法