Python 爬虫基础篇——urllib库的使用

程序员文章站 2022-05-03 20:04:39

...

爬虫常用的几种技巧

1.基本方法

#-*-coding:UTF-8-*-
from urllib import request
response=request.urlopen("http://www.baidu.com/") #此处应该使用http
#http获取数据时信息齐全，https获取数据的信息有缺失，在确定网络地址后，一般采用http
content=response.read().decode('utf-8')
print(content)

2.伪装成浏览器

#-*-coding:UTF-8-*-
from urllib import request
url='http://www.baidu.com'
headers={"User-agent":"'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'"} #此处也可以设置成手机浏览器，模拟手机用户来爬取页面。
response=request.Request(url,headers=headers)
response=request.urlopen(response)
content=response.read().decode('utf-8')
print(content)

3.使用代理IP

#-*-coding:UTF-8-*-
from urllib import request
httpproxy=request.ProxyHandler({'http':'103.249.100.152:80'})#代理无需账号
opener=request.build_opener(httpproxy)#创建一个打开器
request=request.Request('http://www.baidu.com/')
response=opener.open(request)
print(response.read())

3.使用cookie

1. 输出cookie

#-*-coding:UTF-8 -*-
#这里使用的是python2.7
import urllib2 
import cookielib
cookie=cookielib.CookieJar()
httpcookieprocessor=urllib2.HTTPCookieProcessor(cookie)
opener=urllib2.build_opener(httpcookieprocessor)
response=opener.open("http://www.baidu.com")
cookies=''
for data in cookie:
	cookies=cookies+data.name+'='+data.value+';\n'
print cookies

2.save cookie

#-*-coding:UTF-8 -*-
#这里使用的是python2.7
import urllib2 
import cookielib
file_path='cookie.txt'
cookie=cookielib.LWPCookieJar(cookie.xt)
httpcookieprocessor=urllib2.HTTPCookieProcessor(cookie)
opener=urllib2.build_opener(httpcookieprocessor)
response=opener.open("http://www.baidu.com")
cookie.save(ignore_expires=True,ignore_discard=True)

3.引用cookie

import urllib2
import cookielib
filepath="cookie.txt"
cookie=cookielib.LWPCookieJar()
cookie.load("cookie.txt",ignore_discard=True,ignore_expires=True)
header=urllib2.HTTPCookieProcessor(cookie)
opener=urllib2.build_opener(header)
response=opener.open('http://www.baidu.com')
print response.read()

4.注意：

1.python3导入urllib模块命令：import urllib,python2.7导入urllib模块命令是：import urllib2.
2.urllib.request.urlopen(python3)=urllib2.urlope(python2.7)

相关标签： python 爬虫 python cookie http

上一篇：【爬虫】 02 将爬取到的网页写入文件中

下一篇： python爬虫爬取网页

Python 爬虫基础篇——urllib库的使用

爬虫常用的几种技巧

1.基本方法

2.伪装成浏览器

3.使用代理IP

3.使用cookie

1. 输出cookie

2.save cookie

3.引用cookie

4.注意：

Python标准库urllib2的一些使用细节总结

Python机器学习基础之Numpy库的使用

Python爬虫之Selenium库的使用方法

零基础写python爬虫之urllib2中的两个重要概念：Openers和Handlers

零基础写python爬虫之使用urllib2组件抓取网页内容

零基础写python爬虫之urllib2使用指南

Python2和Python3中urllib库中urlencode的使用注意事项

Python标准库urllib2的一些使用细节总结

使用Python的urllib和urllib2模块制作爬虫的实例教程

Python中使用urllib2模块编写爬虫的简单上手示例