[Python3.x]网络爬虫(一):利用urllib通过指定的URL抓取网页内容
程序员文章站
2022-05-04 11:28:42
...
1.爬百度首页,
方法1:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import urllib.request
response = urllib.request.urlopen('http://www.lovejing.com/')
html = response.read();
print(html);
- 1
- 2
- 3
- 4
- 5
- 6
方法2:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import urllib.request
req = urllib.request.Request('http://www.lovejing.com/')
response = urllib.request.urlopen(req)
html = response.read();
print(html);
- 1
- 2
- 3
- 4
- 5
- 6
- 7
2.发送data表单数据(POST请求)
import urllib.parse
import urllib.request
url = 'http://www.lovejing.com/cgi-bin/register.cgi'
values = {'name' : 'WHY',
'location' : 'SDU',
'language' : 'Python' }
data = urllib.parse.urlencode(values).encode(encoding='UTF8') # 编码工作
req = urllib.request.Request(url, data) # 发送请求同时传data表单
response = urllib.request.urlopen(req) #接受反馈的信息
the_page = response.read() #读取反馈的内容
print(the_page)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
3.GET请求
import urllib.parse
import urllib.request
data = {}
data['name'] = 'WHY'
data['location'] = 'SDU'
data['language'] = 'Python'
url_values = urllib.parse.urlencode(data)
print(url_values)
url = 'http://www.lovejing.com/example.cgi'
full_url = url + '?' + url_values
data = urllib.request.urlopen(full_url)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
4.设置Headers到http请求
import urllib.parse
import urllib.request
url = 'http://www.lovejing.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'name' : 'WHY',
'location' : 'SDU',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
data = urllib.parse.urlencode(values).encode(encoding='UTF8') # 编码工作
req = urllib.request.Request(url, data,headers) # 发送请求同时传data表单
response = urllib.request.urlopen(req) #接受反馈的信息
the_page = response.read() #读取反馈的内容
print(the_page)
上一篇: Python Scrapy 爬虫 - 爬取多级别的页面
下一篇: Restframework的