Python爬虫学习之路(一)—— 使用Urllib爬取网页
程序员文章站
2022-03-02 20:26:13
...
1.快速体验之最基础使用:urllib.request.urlopen(url)
import urllib.request
url = "https://m.weibo.cn/"
file = urllib.request.urlopen(url)
data = file.read()
dataline = file.readline()
print(data)
2.以网页的形式保存file = urllib.request.urlopen(url)
data = file.read()
dataline = file.readline()
fhandle = open("C:/Python3/web/review/weibo.html","wb")
fhandle.write(data)
fhandle.close()
filename = urllib.request.urlretrieve(url,filename="C:/Python3/web/review/weibo2.html")
urllib.request.urlcleanup()
2.headers
1.build_opener()
headers = ("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/65.0.3325.181 Safari/537.36")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
data = opener.open(url).read()
fhandle = open("C:/Python3/web/review/weibo3.html","wb")
fhandle.write(data)
fhandle.close()
2.add_header()
req = urllib.request.Request(url)
req.add_header("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64)"
"AppleWebKit/537.36 (KHTML, like Gecko)"
"Chrome/65.0.3325.181 Safari/537.36")
data = urllib.request.urlopen(req).read()
3.timeout
url = "http://www.baidu.com"
for i in range(1,100):
try:
file = urllib.request.urlopen(url,timeout = 1)
data = file.read()
print (len(data))
except Exception as e:
print("出现异常-->"+str(e))
4.HTTP协议请求实战
1.GET
import urllib.request
filepath = "C:/Python3/web/second/get你好.html"
keywd = "你好"# "hello"
key_code = urllib.request.quote(keywd)
url = "http://www.baidu.com/s?wd="+key_code
req = urllib.request.Request(url)
data = urllib.request.urlopen(req).read()
fhandle = open(filepath,"wb")
fhandle.write(data)
fhandle.close()
2.POSTimport urllib.request
import urllib.parse
filename = "C:/Python3/web/second/post.html"
url = "http://iqianyue.com/mypost/"
postdata = urllib.parse.urlencode({
"name":"mmm",
"pass":"1111"
}).encode('utf-8')#url编码,设置为utf-8
req = urllib.request.Request(url,postdata)
data = urllib.request.urlopen(req).read()
fhandle = open(filename,"wb")
fhandle.write(data)
fhandle.close()
5.设置代理服务器
# import urllib.request
def use_proxy(proxy_addr,url):
import urllib.request
proxy = urllib.request.ProxyHandler({'http':proxy_addr})
opener = urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
urllib.request.install_opener(opener)
data =urllib.request.urlopen(url).read().decode('utf-8')
return data
proxy_addr = "114.99.30.195:18118"
url = "http://www.baidu.com"
# data_der = urllib.request.urlopen(url).read().decode('utf-8')
data = use_proxy(proxy_addr,url)
print(len(data))
6.DeBugLog
import urllib.request
url = "http://www.baidu.com"
httphd = urllib.request.HTTPHandler(debuglevel=1)
httpshd = urllib.request.HTTPSHandler(debuglevel=1)
opener = urllib.request.build_opener(httphd,httpshd)
urllib.request.install_opener(opener)
data = urllib.request.urlopen(url)
7.URLError
import urllib.request
import urllib.error
url = "http://www.baiduuuuuu.com"
try:
urllib.request.urlopen(url)
# except urllib.error.HTTPError as e:
# print(e.code)
# print(e.reason)
except urllib.error.URLError as e:
if hasattr(e,"code"):
print(e.code)
print(e.reason)
if hasattr(e,"reason"):
print(e.reason)
上一篇: 爬虫基础---urllib库的使用(获取网页信息)
下一篇: 爬虫——urllib爬虫模块
推荐阅读
-
Python爬虫爬取一个网页上的图片地址实例代码
-
Python使用爬虫爬取静态网页图片的方法详解
-
Python爬虫学习==>第十章:使用Requests+正则表达式爬取猫眼电影
-
Python3爬虫之urllib携带cookie爬取网页的方法
-
一个月入门Python爬虫学习,轻松爬取大规模数据
-
Python爬虫学习记录——8.使用自动化神器Selenium爬取动态网页
-
【Python3.6爬虫学习记录】(七)使用Selenium+ChromeDriver爬取知乎某问题的回答
-
使用scrapy做爬虫遇到的一些坑:爬虫使用scrapy爬取网页返回403错误大全以及解决方案
-
python爬虫:使用xpath和find两种方式分别实现使用requests_html库爬取网页中的内容
-
Python使用爬虫爬取静态网页图片的方法详解