《python数据挖掘入门与实践》决策树预测nba数据集
程序员文章站
2024-02-11 16:36:34
...
前言: 学到决策树预测球队输赢时,按照书中网址去下载数据集,无奈怎么也没下载成功。即使下载了excel文件也是破损的。咱可是学了python的银,那好吧,我就把它爬取下来。(资源在下面)
代码:
'''
爬取《python数据挖掘入门与实践》提到的nba赛况
https://www.basketball-reference.com/leagues/NBA_2014_games-october.html
操作:编译.py后,使用save()方法即可
'''
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
BASE_URL = 'https://www.basketball-reference.com/leagues/NBA_2014_games-{month}.html'
all_month = np.array(['october','november','december','january','february','march','april','may','june'])
def get_content():
list = []
for i in range(len(all_month)):
url = BASE_URL.format(month=all_month[i])
print(url)
html = urlopen(url).read()
bsObj = BeautifulSoup(html,'lxml')
rows = [dd for dd in bsObj.select('tbody tr')]#selectk()可以多重刷选
for row in rows:
cell = [i.text for i in row.find_all('td')]#对于每一个tr标签内也可以进行td标签筛选
list.append(cell)
return list#返回二维列表
#存储为scv格式
def save():
file = open('D:\\Python\\PythonProject\\nba_decisiontree_test\\matches.csv','w')#地址要自己改
list = get_content()
df_data = pd.DataFrame(columns=[1,2,3,4,5,6,7,8,9] ,data=list)
df_data.to_csv(file)
print('done')
输出:
>>> save()
https://www.basketball-reference.com/leagues/NBA_2014_games-october.html
https://www.basketball-reference.com/leagues/NBA_2014_games-november.html
https://www.basketball-reference.com/leagues/NBA_2014_games-december.html
https://www.basketball-reference.com/leagues/NBA_2014_games-january.html
https://www.basketball-reference.com/leagues/NBA_2014_games-february.html
https://www.basketball-reference.com/leagues/NBA_2014_games-march.html
https://www.basketball-reference.com/leagues/NBA_2014_games-april.html
https://www.basketball-reference.com/leagues/NBA_2014_games-may.html
https://www.basketball-reference.com/leagues/NBA_2014_games-june.html
done
数据展示:
补充: 看到后面发现还有一份数据需要用,但是上面的代码却不能够用在这里。原因是球队排行的数据被注释掉了(查看网页源码可发现)。所以这里用到了正则表达式去获取注释。
代码:
'''
#get_standing_data.py
获取《python数据挖掘入门与实践》决策树nba球队预测的球队排行数据
存储地址自行修改
'''
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import re
#pattern = re.compile('<!--[\s\S]*?-->')#html注释的正则:<!--[\s\S]*?-->
pattern = re.compile('<tbody>[\s\S]*?</tbody>')#模仿html注释的正则
url = 'https://www.basketball-reference.com/leagues/NBA_2013_standings.html'
html = urlopen(url).read()
bsObj = BeautifulSoup(html,'lxml')
content = bsObj.find(id='all_expanded_standings').prettify()
match = re.search(pattern,content)
str_tbody = match.group()
html_tbody = BeautifulSoup(str_tbody,'lxml')#将str字符串传入获得html对象
list = []
for tr in html_tbody.find_all('tr'):
rows = [td.text for td in tr.find_all('td')]
list.append(rows)
#转成csv格式
file = 'D:\\Python\\PythonProject\\nba_decisiontree_test\\standing.csv'#自行修改
df_data = pd.DataFrame(data=list)
df_data.to_csv(file)
print('done')
部分数据展示:
>>> df_data
0 1 2 3 4 5 6 7 \
0 Miami Heat 66-16 37-4 29-12 41-11 25-5 14-4 12-6
1 Oklahoma City Thunder 60-22 34-7 26-15 21-9 39-13 7-3 8-2
2 San Antonio Spurs 58-24 35-6 23-18 25-5 33-19 8-2 9-1
3 Denver Nuggets 57-25 38-3 19-22 19-11 38-14 5-5 10-0
4 Los Angeles Clippers 56-26 32-9 24-17 21-9 35-17 7-3 8-2
5 Memphis Grizzlies 56-26 32-9 24-17 22-8 34-18 8-2 8-2
6 New York Knicks 54-28 31-10 23-18 37-15 17-13 10-6 12-6
7 *lyn Nets 49-33 26-15 23-18 36-16 13-17 11-5 13-5
8 Indiana Pacers 49-32 30-11 19-21 31-20 18-12 6-11 13-3
9 Golden State Warriors 47-35 28-13 19-22 19-11 28-24 7-3 5-5
10 Chicago Bulls 45-37 24-17 21-20 34-18 11-19 13-5 9-7
11 Houston Rockets 45-37 29-12 16-25 21-9 24-28 7-3 7-3
12 Los Angeles Lakers 45-37 29-12 16-25 17-13 28-24 6-4 6-4
13 Atlanta Hawks 44-38 25-16 19-22 29-23 15-15 7-11 11-7
14 Utah Jazz 43-39 30-11 13-28 17-13 26-26 5-5 5-5
15 Boston Celtics 41-40 27-13 14-27 27-24 14-16 7-9 8-9
16 Dallas Mavericks 41-41 24-17 17-24 17-13 24-28 5-5 6-4
文件资源: 有用的话点个赞呗
链接:https://pan.baidu.com/s/1eUfa914 密码:5ptu
———关注我的公众号,一起学数据挖掘————
下一篇: 微信登录小程序