[Python3.x]网络爬虫（一）：利用urllib通过指定的URL抓取网页内容

程序员文章站 2022-05-04 11:29:48

...

1.爬百度首页,
方法1:

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import urllib.request
response = urllib.request.urlopen('http://www.lovejing.com/')
html = response.read();
print(html);

方法2:

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import urllib.request
req = urllib.request.Request('http://www.lovejing.com/')
response = urllib.request.urlopen(req)
html = response.read();
print(html);

2.发送data表单数据(POST请求)

import urllib.parse
import urllib.request

url = 'http://www.lovejing.com/cgi-bin/register.cgi'    

values = {'name' : 'WHY',    
          'location' : 'SDU',    
          'language' : 'Python' }    

data = urllib.parse.urlencode(values).encode(encoding='UTF8') # 编码工作  
req = urllib.request.Request(url, data)  # 发送请求同时传data表单
response = urllib.request.urlopen(req)  #接受反馈的信息
the_page = response.read()  #读取反馈的内容
print(the_page)

3.GET请求

import urllib.parse
import urllib.request

data = {}  

data['name'] = 'WHY'    
data['location'] = 'SDU'    
data['language'] = 'Python'  

url_values = urllib.parse.urlencode(data)
print(url_values)

url =  'http://www.lovejing.com/example.cgi'    
full_url = url + '?' + url_values

data = urllib.request.urlopen(full_url)

4.设置Headers到http请求

import urllib.parse
import urllib.request

url = 'http://www.lovejing.com/cgi-bin/register.cgi'    

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' 
values = {'name' : 'WHY',    
          'location' : 'SDU',    
          'language' : 'Python' }    
headers = { 'User-Agent' : user_agent }  
data = urllib.parse.urlencode(values).encode(encoding='UTF8') # 编码工作  
req = urllib.request.Request(url, data,headers)  # 发送请求同时传data表单
response = urllib.request.urlopen(req)  #接受反馈的信息
the_page = response.read()  #读取反馈的内容
print(the_page)

[Python3.x]网络爬虫（一）：利用urllib通过指定的URL抓取网页内容

[Python]网络爬虫（二）：利用urllib2通过指定的URL抓取网页内容

Python3网络爬虫：利用urllib进行简单的网页抓取（一）

[Python3.x]网络爬虫（一）：利用urllib通过指定的URL抓取网页内容

[Python3.x]网络爬虫（一）：利用urllib通过指定的URL抓取网页内容

[Python]网络爬虫（二）：利用urllib2通过指定的URL抓取网页内容