python爬虫案例之csdn数据采集

程序员文章站 2022-07-08 14:00:59

python爬虫案例——csdn数据采集通过python实现csdn页面的内容采集是相对来说比较容易的，因为csdn不需要登陆，不需要cookie，也不...

python爬虫案例——csdn数据采集

通过python实现csdn页面的内容采集是相对来说比较容易的，因为csdn不需要登陆，不需要cookie，也不需要设置header

python2.7下

#coding:utf-8
#本实例用于获取指定用户csdn的文章名称、连接、阅读数目
import urllib2
import re
from bs4 import BeautifulSoup
#csdn不需要登陆，也不需要cookie,也不需要设置header
print('=======================csdn数据挖掘==========================')
urlstr="http://blog.csdn.net/luanpeng825485697?viewmode=contents"
host = "http://blog.csdn.net/luanpeng825485697"  #根目录

alllink=[urlstr]   #所有需要遍历的网址
data={}
def getdata(html,reg):  #从字符串中安装正则表达式获取值
    pattern = re.compile(reg)
    items = re.findall(pattern, html)
    for item in items:
        urlpath = urllib2.urlparse.urljoin(urlstr,item[0])   #将相对地址，转化为绝对地址
        if not hasattr(object, urlpath):
            data[urlpath] = item
            print urlpath,'     ',  #print最后有个逗号，表示输出不换行
            print item[2], '     ',
            print item[1]



#根据一个网址获取相关连接并添加到集合中
def getlink(url,html):
    soup = BeautifulSoup(html,'html.parser')   #使用html5lib解析，所以需要提前安装好html5lib包
    for tag in soup.find_all('a'):   #从文档中找到所有标签的内容
        link = tag.get('href')
        newurl = urllib2.urlparse.urljoin(url, link) #在指定网址中的连接的绝对连接
        if host not in newurl:  # 如果是站外连接，则放弃
            continue
        if newurl in alllink:   #不添加已经存在的网址
            continue
        if not "http://blog.csdn.net/luanpeng825485697/article/list" in newurl:  #自定义添加一些链接限制
            continue
        alllink.append(newurl)   #将地址添加到链接集合中


#根据一个网址，获取该网址中符合指定正则表达式的内容
def craw(url):
    try:
        request = urllib2.Request(url)  #创建一个请求
        response = urllib2.urlopen(request)  #获取响应
        html = response.read()  #读取返回html源码
        # reg = r'"link_title">\r\nhttp://blog.csdn.net/luanpeng825485697/article/details/(.*)\n.*'  #只匹配文章地址和名称
        reg = r'"link_title">\r\n        http://blog.csdn.net/luanpeng825485697/article/details/(.*)            \r\n.*[\s\S]*?阅读\(http://blog.csdn.net/luanpeng825485697/article/details/(.*)\)'  # 匹配地址、名称、阅读数目
        getdata(html,reg)
        getlink(url,html)

    except urllib2.URLError, e:
        if hasattr(e,"code"):
            print e.code
        if hasattr(e,"reason"):
            print e.reason

for url in alllink:
    craw(url)

上一篇：美媒：面部识别正在中国普及，这家创业企业火了

下一篇：如何用nodejs快速搭建web服务器？

python爬虫案例之csdn数据采集

Python爬取租房数据实例，据说可以入门爬虫的小案例！

Python爬虫之简单的爬取百度贴吧数据

Python基于Scrapy的爬虫数据采集（写入数据库）

Python爬虫入门教程 22-100 CSDN学院课程数据抓取

Python3爬虫（九）数据存储之关系型数据库MySQL

Python爬虫_城市公交、地铁站点和线路数据采集实例

Python3爬虫学习之MySQL数据库存储爬取的信息详解

python实现爬虫统计学校BBS男女比例之数据处理（三）

用python爬虫进行新浪腾讯股票数据采集

详解Python之Scrapy爬虫教程NBA球员数据存放到Mysql数据库

python爬虫案例之csdn数据采集

Python爬取租房数据实例，据说可以入门爬虫的小案例！

Python爬虫之简单的爬取百度贴吧数据

Python基于Scrapy的爬虫 数据采集（写入数据库）

Python爬虫入门教程 22-100 CSDN学院课程数据抓取

Python3爬虫（九） 数据存储之关系型数据库MySQL

Python爬虫_城市公交、地铁站点和线路数据采集实例

Python3爬虫学习之MySQL数据库存储爬取的信息详解

python实现爬虫统计学校BBS男女比例之数据处理（三）

用python爬虫进行新浪腾讯股票数据采集

详解Python之Scrapy爬虫教程NBA球员数据存放到Mysql数据库

Python基于Scrapy的爬虫数据采集（写入数据库）

Python3爬虫（九）数据存储之关系型数据库MySQL