欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

爬取微博

程序员文章站 2024-03-25 13:34:34
...

Ajax(Asynchronous JavaScript and XML)
Request Headers里面x-requested-with: XMLHttpRequest标记此请求为Ajax请求

分析:

  1. 浏览器关闭JavaScript
  2. Request Headers里面x-requested-with: XMLHttpRequest标记此请求为Ajax请求
  3. 筛选出XHR并观察响应内容,内容为json格式

爬取微博

  1. 发现请求方法为get,且type、value和containerid三值固定
    爬取微博
from urllib.parse import urlencode
import requests
from pyquery import PyQuery

base_url = 'https://m.weibo.cn/api/container/getIndex?'
headers = {
    'Referer': 'https://m.weibo.cn/u/2556696984',
    'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/'
                  '78.0.3904.70 Mobile Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
}

def get_page():

    parms ={
        'type' : 'uid',
        'value' : '2556696984',
        'containerid' : '1076032556696984'
    }
    url  = base_url + urlencode(parms)
    
    try:
        response = requests.get(url=url,headers=headers)
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError as e:
        print('Error',e.args)

def prase_page(json):
     if json:
        items = json.get('data').get('cards')

        for item in items:
            item = item.get('mblog')
            weibo = {}
            if item:
                weibo['微博链接'] = items[1].get('scheme')
                weibo['文案'] = PyQuery(item.get('text')).text()
                weibo['来源'] = item.get('source')
                weibo['点赞数'] = item.get('attitudes_count')
                weibo['回复数'] = item.get('comments_count')
                weibo['转发数'] = item.get('reposts_count')
                yield weibo
                
def savetomongo(result):
    if collection.insert(result):
        print('SAVE SUCCESS!')

if __name__ == '__main__':
    json = get_page()
    results = prase_page(json)
    for result in results:
        print(result)
    client = MongoClient()
    db = client['weibo']
    collection = db['weibo1']
    savetomongo(result)