python抓取新浪微博，求教！！?

程序员文章站 2022-04-27 20:13:45

...

python抓取新浪微博，被挡，用了代理，有10个帐号，10个代理，爬的很慢，大家有什么好的办法，谢谢！！！

回复内容：

http://github.com/zhu327/rss 既然你也用python就直接看代码吧

爬这里 http://service.weibo.com/widget/widget_blog.php?uid={uid} 替换uid,无需登录,不会被挡爬手机端
http://weibo.cn
可以参考下面的代码，来自极客学院，侵删

#-*-coding:utf8-*-

import smtplib
from email.mime.text import MIMEText
import requests
from lxml import etree
import os
import time
import sys
reload(sys)
sys.setdefaultencoding('utf-8')



class mailhelper(object):
    '''
    这个类实现发送邮件的功能
    '''
    def __init__(self):

        self.mail_host="smtp.xxxx.com"  #设置服务器
        self.mail_user="xxxx"    #用户名
        self.mail_pass="xxxx"   #密码
        self.mail_postfix="xxxx.com"  #发件箱的后缀

    def send_mail(self,to_list,sub,content):
        me="xxoohelper"+""
        msg = MIMEText(content,_subtype='plain',_charset='utf-8')
        msg['Subject'] = sub
        msg['From'] = me
        msg['To'] = ";".join(to_list)
        try:
            server = smtplib.SMTP()
            server.connect(self.mail_host)
            server.login(self.mail_user,self.mail_pass)
            server.sendmail(me, to_list, msg.as_string())
            server.close()
            return True
        except Exception, e:
            print str(e)
            return False

class xxoohelper(object):
    '''
    这个类实现将爬取微博第一条内容
    '''
    def __init__(self):
        self.url = 'http://weibo.cn/u/xxxxxxx' #请输入准备抓取的微博地址
        self.url_login = 'https://login.weibo.cn/login/'
        self.new_url = self.url_login

    def getSource(self):
        html = requests.get(self.url).content
        return html

    def getData(self,html):
        selector = etree.HTML(html)
        password = selector.xpath('//input[@type="password"]/@name')[0]
        vk = selector.xpath('//input[@name="vk"]/@value')[0]
        action = selector.xpath('//form[@method="post"]/@action')[0]
        self.new_url = self.url_login + action
        data = {
            'mobile' : 'xxxxx@xxx.com',
             password : 'xxxxxx',
            'remember' : 'on',
            'backURL' : 'http://weibo.cn/u/xxxxxx', #此处请修改为微博地址
            'backTitle' : u'微博',
            'tryCount' : '',
            'vk' : vk,
            'submit' : u'登录'
            }
        return data

    def getContent(self,data):
        newhtml = requests.post(self.new_url,data=data).content
        new_selector = etree.HTML(newhtml)
        content = new_selector.xpath('//span[@class="ctt"]')
        newcontent = unicode(content[2].xpath('string(.)')).replace('http://','')
        sendtime = new_selector.xpath('//span[@class="ct"]/text()')[0]
        sendtext = newcontent + sendtime
        return sendtext

    def tosave(self,text):
        f= open('weibo.txt','a')
        f.write(text + '\n')
        f.close()

    def tocheck(self,data):
        if not os.path.exists('weibo.txt'):
            return True
        else:
            f = open('weibo.txt', 'r')
            existweibo = f.readlines()
            if data + '\n' in existweibo:
                return False
            else:
                return True

if __name__ == '__main__':
    mailto_list=['xxxxx@qq.com'] #此处填写接收邮件的邮箱
    helper = xxoohelper()
    while True:
        source = helper.getSource()
        data = helper.getData(source)
        content = helper.getContent(data)
        if helper.tocheck(content):
            if mailhelper().send_mail(mailto_list,u"女神更新啦",content):
                print u"发送成功"
            else:
                print u"发送失败"
            helper.tosave(content)
            print content
        else:
            print u'pass'
        time.sleep(30)

据说爬手机版会有奇效。我以前爬过，不知道现在可行不

爬他的移动端页面，当时限制比网页端少。

爬虫程序部署在google app engine多个节点上跑新浪有开发者平台，有专门的API接口，用爬虫会被屏蔽

声明：本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn核实处理。

python抓取新浪微博，求教！！?

回复内容：

新浪微博会员后悔药找回删除微博

新浪微博开放平台PHP 类 WeiboClient 说明

php CURL的新浪微博接口

PHP采用curl模仿用户登陆新浪微博发微博的方法_PHP教程

新浪微博第三方登陆

Python用webdriver实现微博批量自动关注

关于新浪微博API中授权登录的一些有关问题

爬虫实践---新浪微博爬取+json+csv

小白教程-——如何爬虫新浪微博用户图片，手把手教你

微博Python SDK 发微博