爬虫学习

程序员文章站 2022-03-02 19:32:37

...

2015年11月4日 No comments Article

抓取糗事百科

#!/usr/bin/python
# -*- coding:utf-8 -*-
import sys
import urllib
import urllib2
import re
import cPickle as P
reload(sys)
sys.setdefaultencoding('utf-8')

myFile = file("qiubai.txt","wb+")
#myFile = file("qiubai.txt","a")
page = 1
count = 0
temp = ''
url = "http://www.qiushibaike.com/hot/page/"+str(page)
headers = {"User-Agent":"Mozzila/4.0(compatible;MSIE 5.5;Windows NT)"}
try:
    req = urllib2.Request(url,headers=headers)
    resp = urllib2.urlopen(req)
    content = resp.read().encode('gbk')
    patterns = re.compile('<div.*?class="author.*".*?>n<a.*?>n<(.*?)>n</a>n.*n<h2>(.*)</h2>n.*n.*n{3}<div.*>n{2}(.*)n.*n{2}.*n{4}.*')
    items = re.findall(patterns,content)
    #print 'hello'
    for item in items:
        count = count + 1
        temp += '('+str(count)+')'+str(item[1])+'n'+str(item[2])+'n'+'n'

        #print temp
    print temp

    P.dump(temp,myFile)
except urllib2.URLError,e:
    if hasattr(e,'code'):
        print e.code
    if hasattr(e,'reason'):
        print e.reason

finally:
    myFile.close()
    myFile = open('qiubai.txt','rb')
    #for i in myFile.readlines():
    #   print i
    content = P.load(myFile)
    print content
    myFile.close()

Categories: Python, 爬虫

发表评论取消回复

电子邮件地址不会被公开。必填项已用*标注

姓名 *

电子邮件 *

站点

您可以使用这些HTML标签和属性： <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

上一篇： lucene-利用内存中索引和多线程提高索引效率

下一篇：爬虫学习

爬虫学习

抓取糗事百科

发表评论取消回复

Filter、Servlet、Listener的学习_动力节点Java学院整理

Laravel5.7框架安装与使用学习笔记图文详解

从ThinkPHP3.2.3过渡到ThinkPHP5.0学习笔记图文详解

java学习笔记之DBUtils工具包详解

深入DropDownList用法的一些学习总结分析

学习样式表CSS参考-常用的CSS知识

学习mysql之后的一点总结(基础)

C#学习笔记整理-迭代器模式介绍

vue学习指南：第十篇(详细) - Vue的动画

HTML学习之轮播图

爬虫学习

抓取糗事百科

发表评论 取消回复

Filter、Servlet、Listener的学习_动力节点Java学院整理

Laravel5.7框架安装与使用学习笔记图文详解

从ThinkPHP3.2.3过渡到ThinkPHP5.0学习笔记图文详解

java学习笔记之DBUtils工具包详解

深入DropDownList用法的一些学习总结分析

学习样式表CSS参考-常用的CSS知识

学习mysql之后的一点总结(基础)

C#学习笔记整理-迭代器模式介绍

vue学习指南：第十篇(详细) - Vue的 动画

HTML学习之轮播图

发表评论取消回复

vue学习指南：第十篇(详细) - Vue的动画