用urllib爬虫

程序员文章站 2022-05-03 20:04:21

...

1.初识urllib

urllib库包含以下模块：

urllib.request——打开和读取 URLs
urllib.error——urllib.request异常处理
urllib.parse——解码URLs
urllib.robotparser——解码robots.txt

2.urllib爬虫

2.1 简单的get方法

简单粗暴，容易被封：

from urllib import request,parse

#直接爬取
url = "http://httpbin.org/"
string = request.urlopen(url).read().decode('utf8')
print(string)

加上headers，伪装爬虫：

url = "http://httpbin.org/"
headers = {
    'Host': 'httpbin.org',
    'Referer': 'http://httpbin.org/',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
req = request.Request(url, headers=headers, method='GET')
string = request.urlopen(req).read().decode('utf8')
print(string)

以上两种都会返回网页源代码。

2.2高级一点的post

带上data，发送数据：

from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
    'Host': 'httpbin.org',
    'Referer': 'http://httpbin.org/',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
dict = {
    'name': 'abc',
    'password': '123'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
string = request.urlopen(req).read().decode('utf8')
print(string)

返回一个json文件：

{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "name": "abc",
    "password": "123"
  },
  "headers": {
    "Accept-Encoding": "identity",
    "Connection": "close",
    "Content-Length": "21",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "Referer": "http://httpbin.org/",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
  },
  "json": null,
  "origin": "202.119.46.99",
  "url": "http://httpbin.org/post"
}

如果使用json可以读取其中内容：

j = json.loads(string)
print(j['form']['name'])
#输出：
abc

2.3使用cookie

第一步：获取cookie

import http.cookiejar
from urllib import request
cookie = http.cookiejar.CookieJar()
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

得到返回的cookie：

BAIDUID=112C1EAFD************1B0E3DC9:FG=1
BIDUPSID=112C************F31C931B0E3DC9
H_PS_PSSID=
PSTM=15*****188
delPer=0
BDSVRTM=0
BD_HOME=0

第二步：保存cookie到本地
(1)MozillaCookieJar方式

import http.cookiejar
from urllib import request
filename = 'cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

(2)LWPCookieJar方式

import http.cookiejar
from urllib import request
filename = 'cookie1.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

两者得到的cookie文件有差异

第三步：使用已保存的cookie
保存的哪种类型cookie，就用哪种类型再打开读取：

import http.cookiejar
from urllib import request
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie1.txt', ignore_discard=True, ignore_expires=True)
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

将会返回网页源代码

上一篇： Python爬虫：urllib.request库下载图片

下一篇： Python 爬虫爬取网页

用urllib爬虫

1.初识urllib

2.urllib爬虫

2.1 简单的get方法

2.2高级一点的post

2.3使用cookie

java 用递归获取一个目录下的所有文件路径的小例子

一些常用的Python爬虫技巧汇总

c# 给button添加不规则的图片以及用pictureBox替代button响应点击事件的方法

iOS开发之数字每隔3位用逗号分隔

解决用Aspose.Words,在word文档中创建表格的实现方法

教你用Python脚本快速为iOS10生成图标和截屏

Python 制作糗事百科爬虫实例

Python 爬虫模拟登陆知乎

使用Python多线程爬虫爬取电影天堂资源

js逆向解密之网络爬虫