怎么用Python写爬虫抓取网页数据

程序员文章站 2022-03-25 20:18:53

机器学习首先面临的一个问题就是准备数据，数据的来源大概有这么几种：公司积累数据，购买，交换，*机构及企业公开的数据，通过爬虫从网上抓取。本篇介绍怎么写一个爬虫从网上抓取公开的数据。很多语言都可以写爬虫，但是不同语言的难易程度不同，Python作为一种解释型的胶水语言，上手简单、入门容易，标准库齐 ......

机器学习首先面临的一个问题就是准备数据，数据的来源大概有这么几种：公司积累数据，购买，交换，*机构及企业公开的数据，通过爬虫从网上抓取。本篇介绍怎么写一个爬虫从网上抓取公开的数据。

很多语言都可以写爬虫，但是不同语言的难易程度不同，python作为一种解释型的胶水语言，上手简单、入门容易，标准库齐全，还有丰富的各种开源库，语言本身提供了很多提高开发效率的语法糖，开发效率高，总之“人生苦短，快用python”(life is short, you need python!)。在web网站开发，科学计算，数据挖掘/分析，人工智能等很多领域广泛使用。

开发环境配置，python3.5.2，scrapy1.2.1，使用pip安装scrapy，命令：pip3 install scrapy，此命令在mac下会自动安装scrapy的依赖包，安装过程中如果出现网络超时，多试几次。

创建工程

首先创建一个scrapy工程，工程名为：kiwi，命令：scrapy startproject kiwi，将创建一些文件夹和文件模板。

怎么用Python写爬虫抓取网页数据

定义数据结构

settings.py是一些设置信息，items.py用来保存解析出来的数据，在此文件里定义一些数据结构，示例代码：

 1 # -*- coding: utf-8 -*-
 2 
 3 # define here the models for your scraped items
 4 #
 5 # see documentation in:
 6 # http://doc.scrapy.org/en/latest/topics/items.html
 7 
 8 import scrapy
 9 
10 
11 class authorinfo(scrapy.item):
12     authorname = scrapy.field()  # 作者昵称
13     authorurl = scrapy.field()  # 作者url
14 
15 class replyitem(scrapy.item):
16     content = scrapy.field()  # 回复内容
17     time = scrapy.field()  # 发布时间
18     author = scrapy.field() # 回复人(authorinfo)
19 
20 class topicitem(scrapy.item):
21     title = scrapy.field() # 帖子标题
22     url = scrapy.field() # 帖子页面url
23     content = scrapy.field() # 帖子内容
24     time = scrapy.field()  # 发布时间
25     author = scrapy.field() # 发帖人(authorinfo)
26     reply = scrapy.field() # 回复列表(replyitem list)
27     replycount = scrapy.field() # 回复条数

上面topicitem中嵌套了authorinfo和replyitem list，但是初始化类型必须是scrapy.field()，注意这三个类都需要从scrapy.item继续。

创建爬虫蜘蛛

工程目录spiders下的kiwi_spider.py文件是爬虫蜘蛛代码，爬虫代码写在这个文件里。示例以爬豆瓣群组里的帖子和回复为例。

  1 # -*- coding: utf-8 -*-
  2 from scrapy.selector import selector
  3 from scrapy.spiders import crawlspider, rule
  4 from scrapy.linkextractors import linkextractor
  5 
  6 from kiwi.items import topicitem, authorinfo, replyitem
  7 class kiwispider(crawlspider):
  8     name = "kiwi"
  9     allowed_domains = ["douban.com"]
 10 
 11     anchortitlexpath = 'a/text()'
 12     anchorhrefxpath = 'a/@href'
 13 
 14     start_urls = [
 15         "https://www.douban.com/group/topic/90895393/?start=0",
 16     ]
 17     rules = (
 18         rule(
 19             linkextractor(allow=(r'/group/[^/]+/discussion\?start=\d+',)),
 20                 callback='parse_topic_list',
 21                 follow=true
 22         ),
 23         rule(
 24             linkextractor(allow=(r'/group/topic/\d+/$',)),  # 帖子内容页面
 25                 callback='parse_topic_content',
 26                 follow=true
 27         ),
 28         rule(
 29             linkextractor(allow=(r'/group/topic/\d+/\?start=\d+',)), # 帖子内容页面
 30                 callback='parse_topic_content',
 31                 follow=true
 32         ),
 33     )
 34 
 35     # 帖子详情页面
 36     def parse_topic_content(self, response):
 37         # 标题xpath
 38         titlexpath = '//html/head/title/text()'
 39         # 帖子内容xpath
 40         contentxpath = '//div[@class="topic-content"]/p/text()'
 41         # 发帖时间xpath
 42         timexpath = '//div[@class="topic-doc"]/h3/span[@class="color-green"]/text()'
 43         # 发帖人xpath
 44         authorxpath = '//div[@class="topic-doc"]/h3/span[@class="from"]'
 45 
 46         item = topicitem()
 47         # 当前页面url
 48         item['url'] = response.url
 49         # 标题
 50         titlefragment = selector(response).xpath(titlexpath)
 51         item['title'] = str(titlefragment.extract()[0]).strip()
 52 
 53         # 帖子内容
 54         contentfragment = selector(response).xpath(contentxpath)
 55         strs = [line.extract().strip() for line in contentfragment]
 56         item['content'] = '\n'.join(strs)
 57         # 发帖时间
 58         timefragment = selector(response).xpath(timexpath)
 59         if timefragment:
 60             item['time'] = timefragment[0].extract()
 61 
 62         # 发帖人信息
 63         authorinfo = authorinfo()
 64         authorfragment = selector(response).xpath(authorxpath)
 65         if authorfragment:
 66             authorinfo['authorname'] = authorfragment[0].xpath(self.anchortitlexpath).extract()[0]
 67             authorinfo['authorurl'] = authorfragment[0].xpath(self.anchorhrefxpath).extract()[0]
 68 
 69         item['author'] = dict(authorinfo)
 70 
 71         # 回复列表xpath
 72         replyrootxpath = r'//div[@class="reply-doc content"]'
 73         # 回复时间xpath
 74         replytimexpath = r'div[@class="bg-img-green"]/h4/span[@class="pubtime"]/text()'
 75         # 回复人xpath
 76         replyauthorxpath = r'div[@class="bg-img-green"]/h4'
 77 
 78         replies = []
 79         itemsfragment = selector(response).xpath(replyrootxpath)
 80         for replyitemxpath in itemsfragment:
 81             replyitem = replyitem()
 82             # 回复内容
 83             contents = replyitemxpath.xpath('p/text()')
 84             strs = [line.extract().strip() for line in contents]
 85             replyitem['content'] = '\n'.join(strs)
 86             # 回复时间
 87             timefragment = replyitemxpath.xpath(replytimexpath)
 88             if timefragment:
 89                 replyitem['time'] = timefragment[0].extract()
 90             # 回复人
 91             replyauthorinfo = authorinfo()
 92             authorfragment = replyitemxpath.xpath(replyauthorxpath)
 93             if authorfragment:
 94                 replyauthorinfo['authorname'] = authorfragment[0].xpath(self.anchortitlexpath).extract()[0]
 95                 replyauthorinfo['authorurl'] = authorfragment[0].xpath(self.anchorhrefxpath).extract()[0]
 96 
 97             replyitem['author'] = dict(replyauthorinfo)
 98             # 添加进回复列表
 99             replies.append(dict(replyitem))
100 
101         item['reply'] = replies
102         yield item
103 
104     # 帖子列表页面
105     def parse_topic_list(self, response):
106         # 帖子列表xpath(跳过表头行)
107         topicrootxpath = r'//table[@class="olt"]/tr[position()>1]'
108         # 单条帖子条目xpath
109         titlexpath = r'td[@class="title"]'
110         # 发帖人xpath
111         authorxpath = r'td[2]'
112         # 回复条数xpath
113         replycountxpath = r'td[3]/text()'
114         # 发帖时间xpath
115         timexpath = r'td[@class="time"]/text()'
116 
117         topicspath = selector(response).xpath(topicrootxpath)
118         for topicitempath in topicspath:
119             item = topicitem()
120             titlepath = topicitempath.xpath(titlexpath)
121             item['title'] = titlepath.xpath(self.anchortitlexpath).extract()[0]
122             item['url'] = titlepath.xpath(self.anchorhrefxpath).extract()[0]
123             # 发帖时间
124             timepath = topicitempath.xpath(timexpath)
125             if timepath:
126                 item['time'] = timepath[0].extract()
127                 # 发帖人
128                 authorpath = topicitempath.xpath(authorxpath)
129                 authinfo = authorinfo()
130                 authinfo['authorname'] = authorpath[0].xpath(self.anchortitlexpath).extract()[0]
131                 authinfo['authorurl'] = authorpath[0].xpath(self.anchorhrefxpath).extract()[0]
132                 item['author'] = dict(authinfo)
133                 # 回复条数
134                 replycountpath = topicitempath.xpath(replycountxpath)
135                 item['replycount'] = replycountpath[0].extract()
136 
137             item['content'] = ''
138             yield item
139 
140     parse_start_url = parse_topic_content

特别注意

1、kiwispider需要改成从crawlspider类继承，模板生成的代码是从spider继承的，那样的话不会去爬rules里的页面。

2、parse_start_url = parse_topic_list 是定义入口函数，从crawlspider类的代码里可以看到parse函数回调的是parse_start_url函数，子类可以重写这个函数，也可以像上面代码那样给它赋值一个新函数。

3、start_urls里是入口网址，可以添加多个网址。

4、rules里定义在抓取到的网页中哪些网址需要进去爬，规则和对应的回调函数，规则用正则表达式写。上面的示例代码，定义了继续抓取帖子详情首页及分页。

5、注意代码里用dict()包装的部分，items.py文件里定义数据结构的时候，author属性实际需要的是authorinfo类型，赋值的时候必须用dict包装起来，item['author'] = authinfo 赋值会报error。

6、提取内容的时候利用xpath取出需要的内容，有关xpath的资料参看：xpath教程 http://www.w3school.com.cn/xpath/。开发过程中可以利用浏览器提供的工具查看xpath，比如firefox 浏览器中的firebug、firepath插件，对于https://www.douban.com/group/python/discussion?start=0这个页面，xpath规则“//td[@class="title"]”可以获取到帖子标题列表，示例：

怎么用Python写爬虫抓取网页数据

上图红框中可以输入xpath规则，方便测试xpath的规则是否符合要求。新版firefox可以安装 try xpath 这个插件查看xpath，chrome浏览器可以安装 xpath helper 插件。

使用随机useragent

为了让网站看来更像是正常的浏览器访问，可以写一个middleware提供随机的user-agent，在工程根目录下添加文件useragentmiddleware.py，示例代码：

 1 # -*-coding:utf-8-*-
 2 
 3 import random
 4 from scrapy.downloadermiddlewares.useragent import useragentmiddleware
 5 
 6 
 7 class rotateuseragentmiddleware(useragentmiddleware):
 8     def __init__(self, user_agent=''):
 9         self.user_agent = user_agent
10 
11     def process_request(self, request, spider):
12         ua = random.choice(self.user_agent_list)
13         if ua:
14             request.headers.setdefault('user-agent', ua)
15 
16     # for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
17     user_agent_list = [ \
18         "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.1 (khtml, like gecko) chrome/22.0.1207.1 safari/537.1" \
19         "mozilla/5.0 (x11; cros i686 2268.111.0) applewebkit/536.11 (khtml, like gecko) chrome/20.0.1132.57 safari/536.11", \
20         "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/536.6 (khtml, like gecko) chrome/20.0.1092.0 safari/536.6", \
21         "mozilla/5.0 (windows nt 6.2) applewebkit/536.6 (khtml, like gecko) chrome/20.0.1090.0 safari/536.6", \
22         "mozilla/5.0 (windows nt 6.2; wow64) applewebkit/537.1 (khtml, like gecko) chrome/19.77.34.5 safari/537.1", \
23         "mozilla/5.0 (x11; linux x86_64) applewebkit/536.5 (khtml, like gecko) chrome/19.0.1084.9 safari/536.5", \
24         "mozilla/5.0 (windows nt 6.0) applewebkit/536.5 (khtml, like gecko) chrome/19.0.1084.36 safari/536.5", \
25         "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/536.3 (khtml, like gecko) chrome/19.0.1063.0 safari/536.3", \
26         "mozilla/5.0 (windows nt 5.1) applewebkit/536.3 (khtml, like gecko) chrome/19.0.1063.0 safari/536.3", \
27         "mozilla/5.0 (macintosh; intel mac os x 10_8_0) applewebkit/536.3 (khtml, like gecko) chrome/19.0.1063.0 safari/536.3", \
28         "mozilla/5.0 (windows nt 6.2) applewebkit/536.3 (khtml, like gecko) chrome/19.0.1062.0 safari/536.3", \
29         "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/536.3 (khtml, like gecko) chrome/19.0.1062.0 safari/536.3", \
30         "mozilla/5.0 (windows nt 6.2) applewebkit/536.3 (khtml, like gecko) chrome/19.0.1061.1 safari/536.3", \
31         "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/536.3 (khtml, like gecko) chrome/19.0.1061.1 safari/536.3", \
32         "mozilla/5.0 (windows nt 6.1) applewebkit/536.3 (khtml, like gecko) chrome/19.0.1061.1 safari/536.3", \
33         "mozilla/5.0 (windows nt 6.2) applewebkit/536.3 (khtml, like gecko) chrome/19.0.1061.0 safari/536.3", \
34         "mozilla/5.0 (x11; linux x86_64) applewebkit/535.24 (khtml, like gecko) chrome/19.0.1055.1 safari/535.24", \
35         "mozilla/5.0 (windows nt 6.2; wow64) applewebkit/535.24 (khtml, like gecko) chrome/19.0.1055.1 safari/535.24"
36     ]

修改settings.py，添加下面的设置，

downloader_middlewares = {
   'kiwi.useragentmiddleware.rotateuseragentmiddleware': 1,
}

同时禁用cookie，cookies_enabled = false。

运行爬虫

切换到工程根目录，输入命令：scrapy crawl kiwi，console窗口可以看到打印出来的数据，或者使用命令“scrapy crawl kiwi -o result.json -t json”将结果保存到文件里。

怎么抓取用js代码动态输出的网页数据

上面的例子对由执行js代码输出数据的页面不适用，好在python的工具库多，可以安装phantomjs这个工具，从官网下载解压即可。下面以抓取 http://www.kjj.com/index_kfjj.html 这个网页的基金净值数据为例，这个页面的数据是由js代码动态输出的，js代码执行之后才会输出基金净值列表。fund_spider.py代码

 1 # -*- coding: utf-8 -*-
 2 from scrapy.selector import selector
 3 from datetime import  datetime
 4 from selenium import webdriver
 5 from fundequity import fundequity
 6 
 7 class pagespider(object):
 8     def __init__(self):
 9         phantomjspath = "/library/frameworks/python.framework/versions/3.5/phantomjs/bin/phantomjs"
10         cap = webdriver.desiredcapabilities.phantomjs
11         cap["phantomjs.page.settings.resourcetimeout"] = 1000
12         cap["phantomjs.page.settings.loadimages"] = false
13         cap["phantomjs.page.settings.disk-cache"] = false
14         self.driver = webdriver.phantomjs(executable_path=phantomjspath, desired_capabilities=cap)
15 
16     def fetchpage(self, url):
17         self.driver.get(url)
18         html = self.driver.page_source
19         return html
20 
21     def parse(self, html):
22         fundlistxpath = r'//div[@id="maininfo_all"]/table[@id="ilist"]/tbody/tr[position()>1]'
23         itemsfragment = selector(text=html).xpath(fundlistxpath)
24         for itemxpath in itemsfragment:
25             attrxpath = itemxpath.xpath(r'td[1]/text()')
26             text = attrxpath[0].extract().strip()
27             if text != "-":
28                 fe = fundequity()
29                 fe.serial = text
30 
31                 attrxpath = itemxpath.xpath(r'td[2]/text()')
32                 text = attrxpath[0].extract().strip()
33                 fe.date = datetime.strptime(text, "%y-%m-%d")
34 
35                 attrxpath = itemxpath.xpath(r'td[3]/text()')
36                 text = attrxpath[0].extract().strip()
37                 fe.code = text
38 
39                 attrxpath = itemxpath.xpath(r'td[4]/a/text()')
40                 text = attrxpath[0].extract().strip()
41                 fe.name = text
42 
43                 attrxpath = itemxpath.xpath(r'td[5]/text()')
44                 text = attrxpath[0].extract().strip()
45                 fe.equity = text
46 
47                 attrxpath = itemxpath.xpath(r'td[6]/text()')
48                 text = attrxpath[0].extract().strip()
49                 fe.accumulationequity = text
50 
51                 attrxpath = itemxpath.xpath(r'td[7]/font/text()')
52                 text = attrxpath[0].extract().strip()
53                 fe.increment = text
54 
55                 attrxpath = itemxpath.xpath(r'td[8]/font/text()')
56                 text = attrxpath[0].extract().strip().strip('%')
57                 fe.growthrate = text
58 
59                 attrxpath = itemxpath.xpath(r'td[9]/a/text()')
60                 if len(attrxpath) > 0:
61                     text = attrxpath[0].extract().strip()
62                     if text == "购买":
63                         fe.canbuy = true
64                     else:
65                         fe.canbuy = false
66 
67                 attrxpath = itemxpath.xpath(r'td[10]/font/text()')
68                 if len(attrxpath) > 0:
69                     text = attrxpath[0].extract().strip()
70                     if text == "赎回":
71                         fe.canredeem = true
72                     else:
73                         fe.canredeem = false
74 
75                 yield fe
76 
77     def __del__(self):
78         self.driver.quit()
79 
80 def test():
81     spider = pagespider()
82     html = spider.fetchpage("http://www.kjj.com/index_kfjj.html")
83     for item in spider.parse(html):
84         print(item)
85     del spider
86 
87 if __name__ == "__main__":
88     test()


  1 # -*- coding: utf-8 -*-
  2 from datetime import date
  3 
  4 # 基金净值信息
  5 class fundequity(object):
  6     def __init__(self):
  7         # 类实例即对象的属性
  8         self.__serial = 0  # 序号
  9         self.__date = none  # 日期
 10         self.__code = ""  # 基金代码
 11         self.__name = ""  # 基金名称
 12         self.__equity = 0.0  # 单位净值
 13         self.__accumulationequity = 0.0  # 累计净值
 14         self.__increment = 0.0  # 增长值
 15         self.__growthrate = 0.0  # 增长率
 16         self.__canbuy = false # 是否可以购买
 17         self.__canredeem = true # 是否能赎回
 18 
 19     @property
 20     def serial(self):
 21         return self.__serial
 22 
 23     @serial.setter
 24     def serial(self, value):
 25         self.__serial = value
 26 
 27     @property
 28     def date(self):
 29         return self.__date
 30 
 31     @date.setter
 32     def date(self, value):
 33         # 数据检查
 34         if not isinstance(value, date):
 35             raise valueerror('date must be date type!')
 36         self.__date = value
 37 
 38     @property
 39     def code(self):
 40         return self.__code
 41 
 42     @code.setter
 43     def code(self, value):
 44         self.__code = value
 45 
 46     @property
 47     def name(self):
 48         return self.__name
 49 
 50     @name.setter
 51     def name(self, value):
 52         self.__name = value
 53 
 54     @property
 55     def equity(self):
 56         return self.__equity
 57 
 58     @equity.setter
 59     def equity(self, value):
 60         self.__equity = value
 61 
 62     @property
 63     def accumulationequity(self):
 64         return self.__accumulationequity
 65 
 66     @accumulationequity.setter
 67     def accumulationequity(self, value):
 68         self.__accumulationequity = value
 69 
 70     @property
 71     def increment(self):
 72         return self.__increment
 73 
 74     @increment.setter
 75     def increment(self, value):
 76         self.__increment = value
 77 
 78     @property
 79     def growthrate(self):
 80         return self.__growthrate
 81 
 82     @growthrate.setter
 83     def growthrate(self, value):
 84         self.__growthrate = value
 85 
 86     @property
 87     def canbuy(self):
 88         return self.__canbuy
 89 
 90     @canbuy.setter
 91     def canbuy(self, value):
 92         self.__canbuy = value
 93 
 94     @property
 95     def canredeem(self):
 96         return self.__canredeem
 97 
 98     @canredeem.setter
 99     def canredeem(self, value):
100         self.__canredeem = value
101     # 类似其它语言中的tostring()函数
102     def __str__(self):
103         return '[serial:%s,date:%s,code:%s,name:%s,equity:%.4f,\
104 accumulationequity:%.4f,increment:%.4f,growthrate:%.4f%%,canbuy:%s,canredeem:%s]'\
105                % (self.serial, self.date.strftime("%y-%m-%d"), self.code, self.name, float(self.equity), \
106                   float(self.accumulationequity), float(self.increment), \
107 float(self.growthrate), self.canbuy, self.canredeem)

上述代码中fundequity类的属性值使用getter/setter函数方式定义的，这种方式可以对值进行检查。__str__(self)函数类似其它语言里的tostring()。

在命令行运行fund_spider.py代码，console窗口会输出净值数据。

小结

从以上的示例代码中可见少量代码就能把豆瓣网上小组中的帖子和回复数据抓取、内容解析、存储下来，可见python语言的简洁、高效。

例子的代码比较简单，唯一比较花时间的是调 xpath规则，借助于浏览器辅助插件工具能大大提高效率。

例子中没有提及pipeline(管道)、middleware(中间件) 这些复杂东西。没有考虑爬虫请求太频繁导致站方封禁ip(可以通过不断更换http proxy 方式破解)，没有考虑需要登录才能抓取数据的情况(代码模拟用户登录破解)。

实际项目中提取内容的xpath规则、正则表达式这类易变动的部分不应该硬编码写在代码里，网页抓取、内容解析、解析结果的存储等应该使用分布式架构的方式独立运行。总之实际生产环境中运行的爬虫系统需要考虑的问题很多，github上也有一些开源的网络爬虫系统，可以参考。

上一篇： Java程序动态编译Java源文件

下一篇： illustrator cc 2015编辑文字不能敲空格怎么办?

怎么用Python写爬虫抓取网页数据

零基础写python爬虫之使用urllib2组件抓取网页内容