python爬虫-AJAX数据爬取和HTTPS访问笔记

程序员文章站 2022-05-05 15:57:03

...

https://movie.douban.com/j/search_subjects?type=movie&tag=热门&page_limit=10&page_start=0

对需要爬取的连接进行分析，获得以下需要URL编码的标签
type=movie电影标签
tag=热门电影下的热门板块
page_limil=10可以获取不同数量的信息
page_start=0开始位置


from urllib.parse import urlencode
from urllib.request import urlopen, Request
import simplejson

ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
# 对怕爬虫进行伪装，也可以设置UA池进行伪装，使服务器更难分辨爬虫
jurl = 'https://movie.douban.com/j/search_subjects'

d = {
    'type': 'movie',
    'tag': '热门',
    'page_limit': '10',
    'page_start': '0'
}

req = Request('{}?{}'.format(jurl, urlencode(d)), headers={
    'User-agent': ua
})

with urlopen(req) as res:
    subjects = simplejson.loads(res.read())
    print(len(subjects['subjects']))
    print(subjects)

HTTPS证书忽略

from urllib.request import Request, urlopen

request = Request('https://www.12306.cn/mormhweb/')
request.add_header(
    "User_agent",
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1'
)

with urlopen(request) as res:
    print(res._method)
    print(res.read())

P.S. 课程中出现 ssl.CertificateError 错误，原因是当时12306并未进行CA认证，现在已经修复。

对于不安全的https网站，可以导入ssl模块，忽略证书不安全信息

from urllib.request import Request, urlopen
import ssl

request = Request('https://www.12306.cn/mormhweb/')
request.add_header(
    "User_agent",
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1'
)

# 忽略不信任的证书
context = ssl._create_unverified_context()

with urlopen(request, context=context) as res:
    # context参数，实现SSL加密传输。
    print(res._method)
    print(res.read())

现在来说大部分网站都已经完成了CA认证，所以urlopen中的context参数也很少使用了。

python爬虫-AJAX数据爬取和HTTPS访问笔记

HTTPS证书忽略

【Python Scrapy 爬虫框架】 5、利用 pipelines 和 settings 将爬取数据存储到 MongoDB

python基于scrapy爬取京东笔记本电脑数据并进行简单处理和分析

Python3爬虫之urllib爬取异步Ajax数据，使用post请求！

python爬虫-AJAX数据爬取和HTTPS访问笔记

python学习笔记（二十二）爬虫基础（2）：模拟浏览器，ajax动态爬取，爬取数据写入文件、图片爬虫

python:爬虫之Post请求以及动态Ajax数据的爬取（3）

Python爬虫练习五：爬取 2017年统计用区划代码和城乡划分代码（附代码与全部数据）

python3爬虫之访问量、点击率数据的爬取分析

python基于scrapy爬取京东笔记本电脑数据并进行简单处理和分析