python微博新冠病毒肺炎新闻爬虫

程序员文章站 2024-03-08 10:54:58

...

python微博新冠病毒肺炎新闻爬虫

首先说一下我们主要爬取的是什么吧，我们是想在微博中搜索新型冠状病毒并将所有的记录写进excel,如下图蓝线部分的东西，包括微博的内容，时间，来源，点赞数等等：
python微博新冠病毒肺炎新闻爬虫
我们依旧是先上几张结果的图片，以下图片是用dataframe来储存结果

存进excel如下图所示：

详细过程：

整个过程分成三个部分：
1，模拟登录我的微博账号获取cookie
2，编写爬虫
3，解析爬虫得到的各个网页

先引入本爬虫所需要的库

#引入爬虫需要的各种包
import requests
from bs4 import BeautifulSoup
import time
import re
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys

1，模拟登录我的微博账号获取cookie

模拟浏览器，用自己的账号密码登录，这里的账号和密码要用自己的噢，因为我改过了????

#此处模拟浏览器登陆
chrome_options = Options()
b = webdriver.Chrome()
b.get('https://weibo.com/')
time.sleep(10)#缓冲一段时间，免得网页出不来报找不到元件的错误
user = b.find_element_by_id('loginname')
pwd = b.find_element_by_name('password')
user.send_keys('15289722520')#账号
pwd.send_keys('xiaomaque')#密码
pwd.send_keys(Keys.ENTER)#回车

获取cookie

cookie = b.get_cookies()#获得cookies，后面都要用
cookie
str1 = ""
for i in cookie:#将cookies接成字符串形式，当然，这个cookies可以直接在网页中拿，但是，我们的模拟登录就是为了获得cookies（它是只在一段时间内有效的）
    str1 += i["name"]+"="+i["value"]+"; "
str1

得到的str1即是我们后面要用的cookie,如下 python微博新冠病毒肺炎新闻爬虫
接下来我们就要编写爬虫方法在网页中拿东西啦，我们在微博的输入框输入“新型冠状病毒”并点击搜索，来到以下页面，记录该页面的链接,https://s.weibo.com/weibo/%25E6%2596%25B0%25E5%259E%258B%25E5%2586%25A0%25E7%258A%25B6%25E7%2597%2585%25E6%25AF%2592?topnav=1&wvr=6&page=1
,这个是我们查询结果第一页的链接 python微博新冠病毒肺炎新闻爬虫
我们查看总的页数：我们发现，总的页数有50页（emmmm…感觉有点少????）

那我们要如何获取左右这些页面的内容呢，我们在第一页点击右键——检查——network——刷新页面，找到第一页返回的这个包，查看它的参数，如下图，发现它的page为1
python微博新冠病毒肺炎新闻爬虫
我们点击下一页，照同样的方式，我们发现它的page为2，如下图

我们发现，除了page，其他的参数都是相同的，所以我们只需要改变第一页链接（https://s.weibo.com/weibo/%25E6%2596%25B0%25E5%259E%258B%25E5%2586%25A0%25E7%258A%25B6%25E7%2597%2585%25E6%25AF%2592?topnav=1&wvr=6&page=1）中的page就可以了，我们编写如下方法，以获取一页的链接：

def get_content(page):
    base_url = 'https://s.weibo.com/weibo/%25E6%2596%25B0%25E5%259E%258B%25E5%2586%25A0%25E7%258A%25B6%25E7%2597%2585%25E6%25AF%2592?topnav=1&wvr=6&page='
    url = base_url + str(page)
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Connection': 'keep-alive',
        'Cookie': str1,
        'Host': 's.weibo.com',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
    }
    try:
        r = requests.get(url,headers = headers)
        soup = BeautifulSoup(r.text, "html.parser")
        return soup
    except:
        print('出错')

得到每一页的链接后，我们来解析他们，获取我们需要的东西，我们要什么呢？我们要的是每一篇微博的内容，时间和来源，收藏数，评论数，转发数，点赞数。
python微博新冠病毒肺炎新闻爬虫
class为card的div里面装着我们要的内容和时间来源，class为card-act的div里面装着我们要的收藏数，评论数，转发数，点赞数。
我们这里直接上代码

dic_all = {}
n = 0
for i in range(51):
    soup = get_content(i)
    for div in soup.find_all('div',{'class':"card"}):
        dic = {}
        for content in div.find_all('p',{'node-type':"feed_list_content"}):
            if '展开全文' in content.text:
                for content in div.find_all('p',{'node-type':"feed_list_content_full"}):  
                    dic['内容'] = re.sub('[\s+]','',content.text).strip('收起全文d')
            else:
                dic['内容'] = re.sub('[\s+]','',content.text)
        for time in div.find_all('p',{'class':"from"}):
            dic['时间和来源'] = re.sub('[\s+]','',time.text)
        for other in div.find_all('div',{'class':"card-act"}):
            string = other.text.strip().split('\n')
            if string[0] != '收藏':
                dic['收藏'] = string[0][3:]
            else:
                dic['收藏'] = 0
            if string[1].strip() != '转发':
                dic['转发'] = string[1].strip()[3:]
            else:
                dic['转发'] = 0
            if string[2].strip() != '评论':
                dic['评论'] = string[2].strip()[3:]
            else:
                dic['评论'] = 0
            if len(string) == 4:
                dic['点赞'] = string[3].strip()
            else:
                dic['点赞'] = 0
        dic_all[n] = dic
        n = n + 1
        print('已爬取{}条记录'.format(n))

需要注意的是含有“展开全文”四个字的微博的处理，观察网页源代码我们会发现，在含有“展开全文”四个字的部分还含有一个节点p,里面放着完整的内容，但是display设置为none，所以不显示 python微博新冠病毒肺炎新闻爬虫
到这里，我们的数据基本上都拿下来了，如下图

我们将它转化为dataFrame，并转置，去除空值，并存为excel

data = pd.DataFrame(dic_all).T.dropna()
data.to_excel(r'C:\Users\weibo.xls')

data如下 python微博新冠病毒肺炎新闻爬虫
需要源代码的兄弟姐妹直接评论区私聊我，我给您们发噢????

上一篇： DataGridView中绑定DataTable数据及相关操作实现代码

下一篇：解析Java的Jackson库中对象的序列化与数据泛型绑定