爬虫基础---urllib库的使用(获取网页信息)

程序员文章站 2022-03-02 20:26:19

...

本文主要包括三个方面
- 请求模块 urllib.request
- 解析模块 urllib.parse
- 异常处理模块 urllib.error

请求模块 urllib.request

1. urllib.request.urlopen
urllib.request.urlopen（url,data = None，timeout=None）
常见的三个参数–url,data,timeout

url参数

import urllib.request
response=urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))#用返回获取响应的内容

data参数

–get请求，不需要添加data参数（请求数据可以放在请求的url结尾）
–post请求，一定要注意对传递的数据进行url编码

import urllib.request
import urllib.parse

#对data参数进行编码
data=bytes(urllib.parse.urlencode(
    {'word':'hello'},encoding='utf-8',
    ) 
res=urllib.request.urlopen('http://www.baidum.com',data=data)

timeout参数

作用：有时候网络不稳定或者服务器异常，避免为了让程序一直等待，可以对请求进行超时设置

import urllib.request
res=urllib.request.urlopen('http://www.baidu.com',timeout=3)#timeout的单位为秒；若出现超时，会返回socket.timeout
print(res.read())

注意：
res.getheaders()—得到响应头
res.status—得到状态码
res.read()—-得到响应文本（二进制的bytes字节组）

2. urllib.request.Request
urllib.request.Request（url，data = None，headers = {}，，method = None ）

data 同上

headers参数

设置目的：进行的简单的反爬虫策略

#方法1--以字典键值对方式
import urllib.request
import urllib.parse

headers={'User-Agent':'xxxxx'}
url='http://httpbin.org/post'#一定要注意，通过post方式发出请求，所对应的url一定要能接受传递的参数
dic={'name':'xiaoming'}
data=bytes(urllib.parse.urlencode(dic),encoding='utf-8')#数据进行编码
req=request.Request(url=url,data=data,headers=headers,method='POST')#包装请求
res=urllib.request.urlopen(req)
print(res.read().decode('utf-8'))

#方法2--利用add_headers方法
import urllib.request
import urllib.parse
url='http://httpbin.org/post'
d={'name':'xiaoming'}
data=bytes(urllib.parse.urlencode(d),encoding='utf-8')#数据进行编码
req=request.Request(url=url,data=data,method='POST')
req.add_headers('User-Agent','xxxxx')#该方法可以自定义请求头字典，然后循环遍历
res=urllib.request.urlopen(req)
print(res.read().decode('uft-8))

3. urllib.request.ProxyHandler

作用：设置IP代理，模仿程序所发的多个请求来源于不同的IP地址(反爬策略之一)

import urllib.request

#建立proxy处理器对象
proxy_handler=urlllib.request.ProxyHandler(
{'http':'http://xxxx:port',
'https':'https:/xxxx:port',})
#创建opener实例，参数为proxy处理器对象
opener=urllib.request.build_opener(proxy_handler)
#使用代理ip的opener打开指定状态的URL
res=opener.open('http://httpbin.org/get')
print(res.read())

4. urllib.request.HTTPCookieProcessor

作用：cookie中保存来登录的信息，发出请求时携带cookie信息访问（http.cookiejar获取及储存cookie）

获取cookie

import http.cookiejar
import urllib.request

#建立cookiejar实例
cookie=http.cookiejar.CookieJar()
#建立cookie处理器对象
handler=urllib.request.HTTPCookieProcessor(cookie)
#创建opener实例，参数为handler
opener=urllib.request.build_opener(handler)
#使用open函数打开url
res=opener.open('http://www.baidu.com')
#输出在浏览器中的cookie信息
for item in cookie:
    print(item.name+'='+item.value)

存储cookie

import http.cookiejar
import urllib.request

filename='cookie.txt'
#建立cookiejar子类对象
cookie=http.cookiejar.MozillaCookieJar(filename)
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
res=opener.open('http://www.baidu.com')
#保存到文件
cookie.save(ignore_discard=True,ignore_expires=True)

从文件中获取cookie

import http.cookiejar,urllib.request

cookie=http.cookiejar.MozillaCookieJar()
cookie,load('cookie.txt',ignore_discard=True,ignore_expires=True)
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
res=opener.open('http://www.baidu.com')
print(res.read().decode('utf-8'))

解析模块 urllib.parse

1.urllib.parse.urlencode
urllib.parse.urlencode（query，doseq = False，safe =”，encoding =
None，errors = None，quote_via = quote_plus )

为post生成数据，如上

为url生成查询字符串

import urllib.request
import urllib.parse

params=urllib.parse.urlencode(
{"name":"sun","age":25})
base_url="http://www.baidu.com?"
print(base_url+params)
#打印结果
#http://www.baidu.com?name=sun&age=25

2.urllib.parse.urlparse
urllib.parse.urlparse(urlstring, scheme=”, allow_fragments=True) 对url进行拆分

import urllib.parse

res=urllib.parse.urlparse('http://www.baidu.com')
print(res)
#返回的结果包括的协议名，域名，目录路径，参数等
# ParseResult(scheme='http', netloc='www.baidu.com', path='', params='', query='', fragment='')

3.urllib.parse.urljoin
对url进行拼接

res=urllib.parse.urljoin('http://www.baidu.com','FAQ.html')
print(res)
#http://www.baidu.com/FAQ.html

异常处理 urllib.error

1.urllib.error.URLError(基本异常)–一个属性
reason：即抓取异常时，只能的打印错误信息

2.urllib.error.HTTPError(URLError的子类)–三个属性
code：打印出现异常时的状态码
reason：打印解释此错误错误的字符串 headers：打印导致该问题的HTTP请求的HTTP响应标头

import urllib.error
import urllib.request

try:
    res=urllib.request.urlopen('http://www.baidu.com')
    print(res.read())
except urllib.error.URLError as e:
    print(e.reason)
except urllib.error.HTTPError as e:
    print(e.code)

爬虫基础---urllib库的使用(获取网页信息)

请求模块 urllib.request

url参数

data参数

timeout参数

data 同上

headers参数

获取cookie

存储cookie

从文件中获取cookie

解析模块 urllib.parse

为post生成数据，如上

为url生成查询字符串

异常处理 urllib.error

零基础写python爬虫之使用urllib2组件抓取网页内容

Python使用正则表达式获取网页中所需要的信息

python爬虫基础之urllib的使用

如何使用Python爬虫获取offcn上的公考信息及写入Excel表格并发送至指定邮箱

Python学习笔记（八）—使用正则获取网页中所需要的信息。

Python3爬虫（三）urllib库的使用

python爬虫入门urllib库的使用

Python爬虫之Urllib库的基本使用

如何使用Python爬虫获取offcn上的公考信息及写入Excel表格并发送至指定邮箱

python爬虫：使用xpath和find两种方式分别实现使用requests_html库爬取网页中的内容