欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

用urllib爬虫

程序员文章站 2022-05-03 20:04:21
...

1.初识urllib

urllib库包含以下模块:

  • urllib.request——打开和读取 URLs
  • urllib.error——urllib.request异常处理
  • urllib.parse——解码URLs
  • urllib.robotparser——解码robots.txt

2.urllib爬虫

2.1 简单的get方法

简单粗暴,容易被封:

from urllib import request,parse

#直接爬取
url = "http://httpbin.org/"
string = request.urlopen(url).read().decode('utf8')
print(string)

加上headers,伪装爬虫:

url = "http://httpbin.org/"
headers = {
    'Host': 'httpbin.org',
    'Referer': 'http://httpbin.org/',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
req = request.Request(url, headers=headers, method='GET')
string = request.urlopen(req).read().decode('utf8')
print(string)

以上两种都会返回网页源代码。

2.2高级一点的post

带上data,发送数据:

from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
    'Host': 'httpbin.org',
    'Referer': 'http://httpbin.org/',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
dict = {
    'name': 'abc',
    'password': '123'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
string = request.urlopen(req).read().decode('utf8')
print(string)

返回一个json文件:

{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "name": "abc",
    "password": "123"
  },
  "headers": {
    "Accept-Encoding": "identity",
    "Connection": "close",
    "Content-Length": "21",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "Referer": "http://httpbin.org/",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
  },
  "json": null,
  "origin": "202.119.46.99",
  "url": "http://httpbin.org/post"
}

如果使用json可以读取其中内容:

j = json.loads(string)
print(j['form']['name'])
#输出:
abc

2.3使用cookie

第一步:获取cookie

import http.cookiejar
from urllib import request
cookie = http.cookiejar.CookieJar()
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

得到返回的cookie:

BAIDUID=112C1EAFD************1B0E3DC9:FG=1
BIDUPSID=112C************F31C931B0E3DC9
H_PS_PSSID=
PSTM=15*****188
delPer=0
BDSVRTM=0
BD_HOME=0

第二步:保存cookie到本地
(1)MozillaCookieJar方式

import http.cookiejar
from urllib import request
filename = 'cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

(2)LWPCookieJar方式

import http.cookiejar
from urllib import request
filename = 'cookie1.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

两者得到的cookie文件有差异

第三步:使用已保存的cookie
保存的哪种类型cookie,就用哪种类型再打开读取:

import http.cookiejar
from urllib import request
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie1.txt', ignore_discard=True, ignore_expires=True)
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

将会返回网页源代码