python爬虫基础

程序员文章站 2022-06-26 20:02:38

Note：一：简单爬虫的基本步骤1.爬虫的前奏： (1)明确目的 (2)找到数据对应的网页 (3)分析网页的结构，找到数据的位置2.爬虫第二步：__fetch_content方法模拟HTTP请求，向服务器发送这个请求，获取服务器返回给我们的Html 用正则表达式提取我们要的数据3.爬虫第三步：__... ......

note：
一：简单爬虫的基本步骤

1.爬虫的前奏：
    (1)明确目的
    (2)找到数据对应的网页
    (3)分析网页的结构，找到数据的位置

2.爬虫第二步：__fetch_content方法
    模拟http请求，向服务器发送这个请求，获取服务器返回给我们的html
    用正则表达式提取我们要的数据

3.爬虫第三步：__analysis
    (1)找到一个定位标签或者是标识符，利用正则表达式找到需要的内容：
    它的选择原则是：
    唯一原则、就近原则、选择父级闭合标签
    (2)再找到的内容中进一步提取需要的数据，可能多次提取

4.精炼提取到的数据
    利用lambda表达式替换for循环

5.处理精炼后的数据

5.显示处理后的数据

二：程序规范
    1.注释
    2.空行的利用
    3.函数大小10-20行
    4.写平级方法并用主方法调用，避免多级嵌套方法！

四：补充
    beautiful soup, scrapy爬虫框架
    爬虫、反爬虫、反反爬虫
    ip 被封  代理ip
五：总结
    (1)加强对正则表达式的练习
    (2)加强对lambda表达式的练习！
    (3)锻炼面向对象的思维模式

code：

 1 """
 2 this module is used to spider data!
 3 """
 4 
 5 from urllib import request
 6 import re
 7 # 代替print的断点调试方法,特别重要！！！
 8 
 9 
10 class spider:
11     """
12     this class is used to spider data!
13     """
14     url = 'https://www.panda.tv/cate/hearthstone'
15     root_pattern = '<div class="video-info">([\s\s]*?)</div>'     # 非贪婪模式
16     name_pattern = '</i>([\s\s]*?)</span>'
17     number_pattern = '<span class="video-number">([\s\s]*?)</span>'
18 
19     def __fetch_content(self):
20         """
21             this class is used to spider data!
22         """
23 
24         r = request.urlopen(self.url)   # 提取到html
25         html_s = r.read()
26         html = str(html_s, encoding='utf-8')
27 
28         return html
29 
30     def __analysis(self, html):
31         root_html = re.findall(self.root_pattern, html)     # list
32         # print(root_html[0])   # 第一次匹配的结果
33 
34         anchors =[]
35         for html in root_html:
36             name = re.findall(self.name_pattern, html)
37             number = re.findall(self.number_pattern, html)
38             anchor = {'name': name, 'number': number}
39             anchors.append(anchor)
40         # print(anchors[0])
41 
42         return anchors
43 
44     @staticmethod
45     def __refine(anchors):
46         i = lambda anchor: {'name': anchor['name'][0].strip(),  # 列表后面只有一个元素
47                             'number': anchor['number'][0].strip()
48                             }
49         return map(i, anchors)
50 
51     def __sort(self, anchors):      # 业务处理
52         anchors = sorted(anchors, key=self.__sort_seek, reverse=true)
53         return anchors
54 
55     @staticmethod
56     def __sort_seek(anchors):
57         r = re.findall('\d*', anchors['number'])
58         number = float(r[0])
59         if '万' in anchors['number']:
60             number *= 10000
61 
62         return number
63 
64     @staticmethod
65     def __show(anchors):
66         # for anchor in anchors:
67             # print(anchor['name'] + '-----' + anchor['number'])
68         for rank in range(0, len(anchors)):
69             print('rank' + str(rank + 1)
70                   + ' : ' + anchors[rank]['name']
71                   + '   ' + anchors[rank]['number'])
72 
73     def go(self):                           # 主方法（平级的函数）
74         html = self.__fetch_content()       # 获取到文本
75         anchors = self.__analysis(html)     # 分析数据
76         anchors = self.__refine(anchors)    # 精炼数据
77         # print(list(anchors))
78         anchor = self.__sort(anchors)
79         self.__show(anchor)
80 
81 
82 spider = spider()
83 spider.go()

上一篇： 3dsmax不同版本 pyside qt UI 设置max窗口为父窗口的方法

下一篇： Master虐翻围棋高手 AI像人类一样思考?别做梦了

python爬虫基础

Python基本数据结构之字典类型dict用法分析

两个使用Python脚本操作文件的小示例分享

深入解析Python中的lambda表达式的用法

Python面向对象之继承和多态用法分析

linux指令大全app下载，新手快速入门linux基础指令

Python学习笔记之自定义函数用法详解

Python面向对象之类的封装操作示例

Python正则表达式匹配和提取IP地址

Python程序包的构建和发布过程示例详解

python中的协程深入理解