Python爬虫学习之路（一）—— 使用Urllib爬取网页

程序员文章站 2022-03-02 20:26:13

...

1.快速体验之最基础使用:urllib.request.urlopen(url)

import urllib.request
url = "https://m.weibo.cn/"
file = urllib.request.urlopen(url)
data = file.read()
dataline = file.readline()

print(data)

2.以网页的形式保存

file = urllib.request.urlopen(url)
data = file.read()
dataline = file.readline()

fhandle = open("C:/Python3/web/review/weibo.html","wb")
fhandle.write(data)
fhandle.close()

filename = urllib.request.urlretrieve(url,filename="C:/Python3/web/review/weibo2.html")
urllib.request.urlcleanup()

2.headers

1.build_opener()

headers = ("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) "
                        "AppleWebKit/537.36 (KHTML, like Gecko) "
                        "Chrome/65.0.3325.181 Safari/537.36")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
data = opener.open(url).read()
fhandle = open("C:/Python3/web/review/weibo3.html","wb")
fhandle.write(data)
fhandle.close()

2.add_header()

req = urllib.request.Request(url)
req.add_header("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64)"
                        "AppleWebKit/537.36 (KHTML, like Gecko)"
                        "Chrome/65.0.3325.181 Safari/537.36")
data = urllib.request.urlopen(req).read()

3.timeout

url = "http://www.baidu.com" 
for i in range(1,100):
    try:
        file = urllib.request.urlopen(url,timeout = 1)
        data = file.read()
        print (len(data))
    except Exception as e:
        print("出现异常-->"+str(e))

4.HTTP协议请求实战

1.GET

import urllib.request
filepath = "C:/Python3/web/second/get你好.html"
keywd = "你好"# "hello"
key_code = urllib.request.quote(keywd)
url = "http://www.baidu.com/s?wd="+key_code
req = urllib.request.Request(url)
data = urllib.request.urlopen(req).read()
fhandle = open(filepath,"wb")
fhandle.write(data)
fhandle.close()

2.POST

import urllib.request
import  urllib.parse
filename = "C:/Python3/web/second/post.html"
url = "http://iqianyue.com/mypost/"
postdata = urllib.parse.urlencode({
    "name":"mmm",
    "pass":"1111"
}).encode('utf-8')#url编码，设置为utf-8
req = urllib.request.Request(url,postdata)
data = urllib.request.urlopen(req).read()
fhandle = open(filename,"wb")
fhandle.write(data)
fhandle.close()

5.设置代理服务器

# import urllib.request
def use_proxy(proxy_addr,url):
    import urllib.request
    proxy = urllib.request.ProxyHandler({'http':proxy_addr})
    opener = urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
    urllib.request.install_opener(opener)
    data =urllib.request.urlopen(url).read().decode('utf-8')
    return data
proxy_addr = "114.99.30.195:18118"
url = "http://www.baidu.com"
# data_der = urllib.request.urlopen(url).read().decode('utf-8')
data = use_proxy(proxy_addr,url)
print(len(data))

6.DeBugLog

import urllib.request
url = "http://www.baidu.com"
httphd = urllib.request.HTTPHandler(debuglevel=1)
httpshd = urllib.request.HTTPSHandler(debuglevel=1)
opener = urllib.request.build_opener(httphd,httpshd)
urllib.request.install_opener(opener)
data = urllib.request.urlopen(url)

7.URLError

import urllib.request
import urllib.error
url = "http://www.baiduuuuuu.com"
try:
    urllib.request.urlopen(url)
# except urllib.error.HTTPError as e:
#     print(e.code)
#     print(e.reason)
except urllib.error.URLError as e:
    if hasattr(e,"code"):
        print(e.code)
        print(e.reason)
    if hasattr(e,"reason"):
        print(e.reason)

Python爬虫学习之路（一）—— 使用Urllib爬取网页

1.快速体验之最基础使用:urllib.request.urlopen(url)

2.headers

3.timeout

4.HTTP协议请求实战

5.设置代理服务器

6.DeBugLog

7.URLError

Python爬虫爬取一个网页上的图片地址实例代码

Python使用爬虫爬取静态网页图片的方法详解

Python爬虫学习==>第十章：使用Requests+正则表达式爬取猫眼电影

Python3爬虫之urllib携带cookie爬取网页的方法

一个月入门Python爬虫学习,轻松爬取大规模数据

Python爬虫学习记录——8.使用自动化神器Selenium爬取动态网页

【Python3.6爬虫学习记录】（七）使用Selenium+ChromeDriver爬取知乎某问题的回答

使用scrapy做爬虫遇到的一些坑：爬虫使用scrapy爬取网页返回403错误大全以及解决方案

python爬虫：使用xpath和find两种方式分别实现使用requests_html库爬取网页中的内容

Python使用爬虫爬取静态网页图片的方法详解