欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

《python数据挖掘入门与实践》决策树预测nba数据集

程序员文章站 2024-02-11 16:36:34
...

前言: 学到决策树预测球队输赢时,按照书中网址去下载数据集,无奈怎么也没下载成功。即使下载了excel文件也是破损的。咱可是学了python的银,那好吧,我就把它爬取下来。(资源在下面)

代码:

'''
    爬取《python数据挖掘入门与实践》提到的nba赛况
    https://www.basketball-reference.com/leagues/NBA_2014_games-october.html
    操作:编译.py后,使用save()方法即可
'''
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

BASE_URL = 'https://www.basketball-reference.com/leagues/NBA_2014_games-{month}.html' 
all_month = np.array(['october','november','december','january','february','march','april','may','june'])

def get_content():
    list = []
    for i in range(len(all_month)):
        url = BASE_URL.format(month=all_month[i])
        print(url)
        html = urlopen(url).read()
        bsObj = BeautifulSoup(html,'lxml')
        rows = [dd for dd in bsObj.select('tbody tr')]#selectk()可以多重刷选
        for row in rows:
            cell = [i.text for i in row.find_all('td')]#对于每一个tr标签内也可以进行td标签筛选
            list.append(cell)
    return list#返回二维列表
#存储为scv格式
def save():
    file = open('D:\\Python\\PythonProject\\nba_decisiontree_test\\matches.csv','w')#地址要自己改
    list = get_content()
    df_data = pd.DataFrame(columns=[1,2,3,4,5,6,7,8,9] ,data=list)
    df_data.to_csv(file)
    print('done')

输出:

>>> save()
https://www.basketball-reference.com/leagues/NBA_2014_games-october.html
https://www.basketball-reference.com/leagues/NBA_2014_games-november.html
https://www.basketball-reference.com/leagues/NBA_2014_games-december.html
https://www.basketball-reference.com/leagues/NBA_2014_games-january.html
https://www.basketball-reference.com/leagues/NBA_2014_games-february.html
https://www.basketball-reference.com/leagues/NBA_2014_games-march.html
https://www.basketball-reference.com/leagues/NBA_2014_games-april.html
https://www.basketball-reference.com/leagues/NBA_2014_games-may.html
https://www.basketball-reference.com/leagues/NBA_2014_games-june.html
done

数据展示:
《python数据挖掘入门与实践》决策树预测nba数据集

补充: 看到后面发现还有一份数据需要用,但是上面的代码却不能够用在这里。原因是球队排行的数据被注释掉了(查看网页源码可发现)。所以这里用到了正则表达式去获取注释。

代码:

'''
    #get_standing_data.py
    获取《python数据挖掘入门与实践》决策树nba球队预测的球队排行数据
    存储地址自行修改
'''
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import re

#pattern = re.compile('<!--[\s\S]*?-->')#html注释的正则:<!--[\s\S]*?-->
pattern = re.compile('<tbody>[\s\S]*?</tbody>')#模仿html注释的正则
url = 'https://www.basketball-reference.com/leagues/NBA_2013_standings.html'
html = urlopen(url).read()
bsObj = BeautifulSoup(html,'lxml')
content = bsObj.find(id='all_expanded_standings').prettify()
match = re.search(pattern,content)
str_tbody = match.group()
html_tbody = BeautifulSoup(str_tbody,'lxml')#将str字符串传入获得html对象
list = []
for tr in html_tbody.find_all('tr'):
    rows = [td.text for td in tr.find_all('td')]
    list.append(rows)

#转成csv格式
file = 'D:\\Python\\PythonProject\\nba_decisiontree_test\\standing.csv'#自行修改
df_data = pd.DataFrame(data=list)
df_data.to_csv(file)
print('done')



部分数据展示:

>>> df_data
                        0      1      2      3      4      5     6     7   \
0               Miami Heat  66-16   37-4  29-12  41-11   25-5  14-4  12-6   
1    Oklahoma City Thunder  60-22   34-7  26-15   21-9  39-13   7-3   8-2   
2        San Antonio Spurs  58-24   35-6  23-18   25-5  33-19   8-2   9-1   
3           Denver Nuggets  57-25   38-3  19-22  19-11  38-14   5-5  10-0   
4     Los Angeles Clippers  56-26   32-9  24-17   21-9  35-17   7-3   8-2   
5        Memphis Grizzlies  56-26   32-9  24-17   22-8  34-18   8-2   8-2   
6          New York Knicks  54-28  31-10  23-18  37-15  17-13  10-6  12-6   
7            *lyn Nets  49-33  26-15  23-18  36-16  13-17  11-5  13-5   
8           Indiana Pacers  49-32  30-11  19-21  31-20  18-12  6-11  13-3   
9    Golden State Warriors  47-35  28-13  19-22  19-11  28-24   7-3   5-5   
10           Chicago Bulls  45-37  24-17  21-20  34-18  11-19  13-5   9-7   
11         Houston Rockets  45-37  29-12  16-25   21-9  24-28   7-3   7-3   
12      Los Angeles Lakers  45-37  29-12  16-25  17-13  28-24   6-4   6-4   
13           Atlanta Hawks  44-38  25-16  19-22  29-23  15-15  7-11  11-7   
14               Utah Jazz  43-39  30-11  13-28  17-13  26-26   5-5   5-5   
15          Boston Celtics  41-40  27-13  14-27  27-24  14-16   7-9   8-9   
16        Dallas Mavericks  41-41  24-17  17-24  17-13  24-28   5-5   6-4   

文件资源: 有用的话点个赞呗

链接:https://pan.baidu.com/s/1eUfa914 密码:5ptu

———关注我的公众号,一起学数据挖掘————
《python数据挖掘入门与实践》决策树预测nba数据集