欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

[Python3.x]网络爬虫(一):利用urllib通过指定的URL抓取网页内容

程序员文章站 2022-05-04 11:29:48
...

1.爬百度首页,
方法1:

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import urllib.request
response = urllib.request.urlopen('http://www.lovejing.com/')
html = response.read();
print(html);

方法2:

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import urllib.request
req = urllib.request.Request('http://www.lovejing.com/')
response = urllib.request.urlopen(req)
html = response.read();
print(html);

2.发送data表单数据(POST请求)

import urllib.parse
import urllib.request

url = 'http://www.lovejing.com/cgi-bin/register.cgi'    

values = {'name' : 'WHY',    
          'location' : 'SDU',    
          'language' : 'Python' }    

data = urllib.parse.urlencode(values).encode(encoding='UTF8') # 编码工作  
req = urllib.request.Request(url, data)  # 发送请求同时传data表单
response = urllib.request.urlopen(req)  #接受反馈的信息
the_page = response.read()  #读取反馈的内容
print(the_page)

3.GET请求

import urllib.parse
import urllib.request

data = {}  

data['name'] = 'WHY'    
data['location'] = 'SDU'    
data['language'] = 'Python'  

url_values = urllib.parse.urlencode(data)
print(url_values)

url =  'http://www.lovejing.com/example.cgi'    
full_url = url + '?' + url_values

data = urllib.request.urlopen(full_url)

4.设置Headers到http请求

import urllib.parse
import urllib.request

url = 'http://www.lovejing.com/cgi-bin/register.cgi'    

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' 
values = {'name' : 'WHY',    
          'location' : 'SDU',    
          'language' : 'Python' }    
headers = { 'User-Agent' : user_agent }  
data = urllib.parse.urlencode(values).encode(encoding='UTF8') # 编码工作  
req = urllib.request.Request(url, data,headers)  # 发送请求同时传data表单
response = urllib.request.urlopen(req)  #接受反馈的信息
the_page = response.read()  #读取反馈的内容
print(the_page)
相关标签: python