python网络爬虫二：Requests库网络爬虫实战项目

程序员文章站 2022-07-14 11:03:17

...

Requests库网络爬虫实战

1京东商品页面爬取

目标页面地址：https://item.jd.com/5089267.html python网络爬虫二：Requests库网络爬虫实战项目

import requests
url = 'https://item.jd.com/5089267.html'
try:
    r = requests.get(url)
    r.raise_for_status()
    r.encoding =r.apparent_encoding
    print(r.text[:1000])
except:
    print("爬取失败")

2 当当网商品页面爬取

目标页面地址：http://product.dangdang.com/26487763.html
python网络爬虫二：Requests库网络爬虫实战项目

import requests
url = 'http://product.dangdang.com/26487763.html'
try:
    r = requests.get(url)
    r.raise_for_status()
    r.encoding =r.apparent_encoding
    print(r.text[:1000])
except IOError as e:
    print(str(e))

出现报错：
HTTPConnectionPool(host=‘127.0.0.1’, port=80): Max retries exceeded with url: /26487763.html (Caused by NewConnectionError(’<urllib3.connection.HTTPConnection object at 0x10fc390>: Failed to establish a new connection: [Errno 111] Connection refused’,))

报错原因：当当网拒绝不合理的浏览器访问。
查看初识的http请求头：
print(r.request.headers) python网络爬虫二：Requests库网络爬虫实战项目
代码改进：构造合理的HTTP请求头

import requests
url = 'http://product.dangdang.com/26487763.html'
try:
    kv = {'user-agent':'Mozilla/5.0'}
    r = requests.get(url,headers=kv)
    r.raise_for_status()
    r.encoding =r.apparent_encoding
    print(r.text[:1000])
except IOError as e:
    print(str(e))

结果正常爬取： python网络爬虫二：Requests库网络爬虫实战项目

3 百度360搜索引擎关键词提交

百度关键词接口：http://www.baidu.com/s?wd=keyword
代码实现：

import requests
keyword = "python"
try:
    kv = {'wd':keyword}
    r = requests.get("http://www.baidu.com/s",params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except IOError as e:
    print(str(e))

执行结果：
python网络爬虫二：Requests库网络爬虫实战项目
360关键词接口：http://www.so.com/s?q=keyword
代码实现：

import requests
keyword = "Linux"
try:
    kv = {'q':keyword}
    r = requests.get("http://www.so.com/s",params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except IOError as e:
    print(str(e))

4 网络图片爬取和存储

网络图片链接的格式：
http://FQDN/picture.jpg
校花网：http://www.xiaohuar.com
选择一个图片地址：http://www.xiaohuar.com/d/file/20141116030511162.jpg
实现代码：

import requests
import os
url = "http://www.xiaohuar.com/d/file/20141116030511162.jpg"
dir = "D://pics//"
path = dir + url.split('/')[-1] #设置图片保存路径并以原图名名字命名
try:
    if not os.path.exists(dir):
        os.mkdir(dir)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path,'wb') as f:
            f.write(r.content)
            f.close()
            print("文件保存成功")
    else:
        print("文件已存在")
except IOError as e:
    print(str(e))

查看图片已经存在：
python网络爬虫二：Requests库网络爬虫实战项目

5 ip地址归属地查询

ip地址归属地查询网站接口：http://www.ip138.com/ips138.asp?ip=
实现代码：

import requests
url = "http://www.ip38.com/ip.php?ip="
try:
    r = requests.get(url+'104.193.88.77')
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text)
except IOError as e:
    print(str(e))

6 有道翻译翻译表单提交

打开有道翻译，在开发者模式依次单击“Network”按钮和“XHR”按钮，找到翻译数据： python网络爬虫二：Requests库网络爬虫实战项目

import requests
import json

def get_translate_date(word=None):
    url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"
    #post参数需要放在请求实体里，构建一个新字典
    form_data = {'i': word,
                 'from': 'AUTO',
                 'to': 'AUTO',
                 'smartresult': 'dict',
                 'client': 'fanyideskweb',
                 'salt': '15569272902260',
                 'sign': 'b2781ea3e179798436b2afb674ebd223',
                 'ts': '1556927290226',
                 'bv': '94d71a52069585850d26a662e1bcef22',
                 'doctype': 'json',
                 'version': '2.1',
                 'keyfrom': 'fanyi.web',
                 'action': 'FY_BY_REALTlME'
                 }
    #请求表单数据
    response = requests.post(url,data=form_data)
    #将JSON格式字符串转字典
    content = json.loads(response.text)
    #打印翻译后的数据
    print(content['translateResult'][0][0]['tgt'])

if __name__ == '__main__':
    word = input("请输入你要翻译的文字：")
    get_translate_date(word)

执行结果：
python网络爬虫二：Requests库网络爬虫实战项目

python网络爬虫二：Requests库网络爬虫实战项目

Requests库网络爬虫实战

目录：

1京东商品页面爬取

2 当当网商品页面爬取

3 百度360搜索引擎关键词提交

4 网络图片爬取和存储

5 ip地址归属地查询

6 有道翻译翻译表单提交

Python网络爬虫项目：内容提取器的定义

Python3网络爬虫中的requests高级用法详解

Python3网络爬虫中的requests高级用法详解

详解Python3网络爬虫(二)：利用urllib.urlopen向有道翻译发送数据获得翻译结果

Python网络爬虫——BeautifulSoup4库的使用

《Python3 网络爬虫开发实战》开发环境配置过程中踩过的坑

《Python3 网络爬虫开发实战》学习资料

python3网络爬虫系统学习：第一讲基本库urllib

Python3爬虫（二）网络爬虫的尺寸与约束

详解Python3网络爬虫(二)：利用urllib.urlopen向有道翻译发送数据获得翻译结果

python网络爬虫二：Requests库网络爬虫实战项目

Requests库网络爬虫实战

目录：

1京东商品页面爬取

2 当当网商品页面爬取

3 百度360搜索引擎关键词提交

4 网络图片爬取和存储

5 ip地址归属地查询

6 有道翻译翻译表单提交

Python网络爬虫项目：内容提取器的定义

Python3网络爬虫中的requests高级用法详解

Python3网络爬虫中的requests高级用法详解

详解Python3网络爬虫(二)：利用urllib.urlopen向有道翻译发送数据获得翻译结果

Python网络爬虫——BeautifulSoup4库的使用

《Python3 网络爬虫开发实战》开发环境配置过程中踩过的坑

《Python3 网络爬虫开发实战》学习资料

python3网络爬虫系统学习：第一讲 基本库urllib

Python3爬虫（二）网络爬虫的尺寸与约束

详解Python3网络爬虫(二)：利用urllib.urlopen向有道翻译发送数据获得翻译结果

python3网络爬虫系统学习：第一讲基本库urllib