Python学习之旅(二十八)
python基础知识(27):常用内建模块(ⅲ)
1、urlblib
urllib提供了一系列用于操作url的功能
url是统一资源定位符,对可以从互联网上得到的资源的位置和访问方法的一种简洁的表示,是互联网上标准资源的地址
互联网上的每个文件都有一个唯一的url,它包含的信息指出文件的位置以及浏览器应该怎么处理它
(1)get
urllib的request
模块可以非常方便地抓取url内容,也就是发送一个get请求到指定的页面,然后返回http的响应
#对豆瓣的一个urlhttps://api.douban.com/v2/book/2129650进行抓取,并返回响应 from urllib import request with request.urlopen('https://api.douban.com/v2/book/2129650') as f: data = f.read() print('status:', f.status, f.reason) for k, v in f.getheaders(): print('%s: %s' % (k, v)) print('data:', data.decode('utf-8')) 结果: status: 200 ok date: sun, 09 dec 2018 01:23:48 gmt content-type: application/json; charset=utf-8 content-length: 2138 connection: close vary: accept-encoding x-ratelimit-remaining2: 99 x-ratelimit-limit2: 100 expires: sun, 1 jan 2006 01:00:00 gmt pragma: no-cache cache-control: must-revalidate, no-cache, private set-cookie: bid=fdbz3slsf0s; expires=mon, 09-dec-19 01:23:48 gmt; domain=.douban.com; path=/ x-douban-newbid: fdbz3slsf0s x-dae-node: brand55 x-dae-app: book server: dae x-frame-options: sameorigin data: {"rating":{"max":10,"numraters":16,"average":"7.4","min":0},"subtitle":"","author":["廖雪峰"],...}
如果我们要想模拟浏览器发送get请求,就需要使用request
对象,通过往request
对象添加http头,我们就可以把请求伪装成浏览器
#模拟iphone 6去请求豆瓣首页 from urllib import request req = request.request('http://www.douban.com/') req.add_header('user-agent', 'mozilla/6.0 (iphone; cpu iphone os 8_0 like mac os x) applewebkit/536.26 (khtml, like gecko) version/8.0 mobile/10a5376e safari/8536.25') with request.urlopen(req) as f: print('status:', f.status, f.reason) for k, v in f.getheaders(): print('%s: %s' % (k, v)) print('data:', f.read().decode('utf-8')) 结果: <title>豆瓣(手机版)</title> <meta name="google-site-verification" content="ok0wcgt20tbbgo9_zat2iacimtn4ftf5ccsh092xeyw" /> <meta name="viewport" content="width=device-width, height=device-height, user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0"> <meta name="format-detection" content="telephone=no"> <link rel="canonical" href=" http://m.douban.com/"> <link href="https://img3.doubanio.com/f/talion/4b1de333c0e597678522bd3c3af276ba6c667b95/css/card/base.css" rel="stylesheet">
(2)post
如果要以post发送一个请求,只需要把参数data
以bytes形式传入
#模拟微博登录,先读取登录的邮箱和口令 from urllib import request, parse print('login to weibo.cn...') email = input('email: ') passwd = input('password: ') login_data = parse.urlencode([ ('username', email), ('password', passwd), ('entry', 'mweibo'), ('client_id', ''), ('savestate', '1'), ('ec', ''), ('pagerefer', 'https://passport.weibo.cn/signin/welcome?entry=mweibo&r=http%3a%2f%2fm.weibo.cn%2f') ]) req = request.request('https://passport.weibo.cn/sso/login') req.add_header('origin', 'https://passport.weibo.cn') req.add_header('user-agent', 'mozilla/6.0 (iphone; cpu iphone os 8_0 like mac os x) applewebkit/536.26 (khtml, like gecko) version/8.0 mobile/10a5376e safari/8536.25') req.add_header('referer', 'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=http%3a%2f%2fm.weibo.cn%2f') with request.urlopen(req, data=login_data.encode('utf-8')) as f: print('status:', f.status, f.reason) for k, v in f.getheaders(): print('%s: %s' % (k, v)) print('data:', f.read().decode('utf-8')) 结果: login to weibo.cn... email: email password: password status: 200 ok server: nginx/1.6.1 date: sun, 09 dec 2018 02:01:40 gmt content-type: text/html transfer-encoding: chunked connection: close vary: accept-encoding cache-control: no-cache, must-revalidate expires: sat, 26 jul 1997 05:00:00 gmt pragma: no-cache access-control-allow-origin: https://passport.weibo.cn access-control-allow-credentials: true dpool_header: 85-144-160-aliyun-core.jpool.sinaimg.cn set-cookie: login=9da7cd806ada2c22779667e8e1c039c2; path=/ data: {"retcode":50011002,"msg":"\u7528\u6237\u540d\u6216\u5bc6\u7801\u9519\u8bef","data":{"username":"email","errline":669}}
(3)handler
如果还需要更复杂的控制,比如通过一个proxy去访问网站,我们需要利用proxyhandler
来处理
import urllib proxy_handler = urllib.request.proxyhandler({'http': 'http://www.example.com:3128/'}) proxy_auth_handler = urllib.request.proxybasicauthhandler() proxy_auth_handler.add_password('realm', 'host', 'username', 'password') opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler) with opener.open('http://www.example.com/login.html') as f: pass
2、xml
操作xml有两种方法:dom和sax
dom会把整个xml读入内存,解析为树,因此占用内存大,解析慢,优点是可以任意遍历树的节点
sax是流模式,边读边解析,占用内存小,解析快,缺点是我们需要自己处理事件
正常情况下,优先考虑sax,因为dom实在太占内存
解析xml
在python中使用sax解析xml非常简洁,通常我们关心的事件是start_element
,end_element
和char_data
,准备好这3个函数,然后就可以解析xml了
<a href="/">python</a> ……
start_element
读取<a href="/">,
char_data读取python,
end_element读取
</a>
from xml.parsers.expat import parsercreate class defaultsaxhandler(object): def start_element(self, name, attrs): print('sax:start_element: %s, attrs: %s' % (name, str(attrs))) def end_element(self, name): print('sax:end_element: %s' % name) def char_data(self, text): print('sax:char_data: %s' % text) xml = r'''<?xml version="1.0"?> <ol> <li><a href="/python">python</a></li> <li><a href="/ruby">ruby</a></li> </ol> '''
生成xml
最简单也是最有效的生成xml的方法是拼接字符串
l = [] l.append(r'<?xml version="1.0"?>') l.append(r'<root>') l.append(encode('some & data')) l.append(r'</root>') return ''.join(l)
生成复杂的xml要用json
3、htmlparser
利用htmlparser,可以把网页中的文本、图像等解析出来
html本质上是xml的子集,但是html的语法没有xml那么严格,所以不能用标准的dom或sax来解析html。
好python提供了htmlparser来非常方便地解析html
from html.parser import htmlparser from html.entities import name2codepoint class myhtmlparser(htmlparser): def handle_starttag(self, tag, attrs): print('<%s>' % tag) def handle_endtag(self, tag): print('</%s>' % tag) def handle_startendtag(self, tag, attrs): print('<%s/>' % tag) def handle_data(self, data): print(data) def handle_comment(self, data): print('<!--', data, '-->') def handle_entityref(self, name): print('&%s;' % name) def handle_charref(self, name): print('&#%s;' % name) parser = myhtmlparser() parser.feed('''<html> <head></head> <body> <!-- test html parser --> <p>some <a href=\"#\">html</a> html tutorial...<br>end</p> </body></html>''') 结果: <html> <head> </head> <body> <!-- test html parser --> <p> some <a> html </a> html tutorial... <br> end </p> </body> </html>
feed()
方法可以多次调用,也就是不一定一次把整个html字符串都塞进去,可以一部分一部分塞进去。
特殊字符有两种,一种是英文表示的
,一种是数字表示的Ӓ
,这两种字符都可以通过parser解析出来