利用python做表格数据处理
技术背景
数据处理是一个当下非常热门的研究方向,通过对于大型实际场景中的数据进行建模,可以用于预测下一阶段可能出现的情况。比如我们有过去的2002年-2018年的黄金价格的数据:
该数据来源于gitee上的一个开源项目。其中包含有:时间、开盘价、收盘价、最高价、最低价、交易数以及成交额这么几个参数。假如我们使用一个机器学习的模型去分析这个数据,也许我们可以预测在这个数据中并不存在的金价数据。如果预测的契合度较好,那么对于一些人的投资策略来说有重大意义。但是这种实际场景下的数据,往往数据量是非常大的。虽然这里我们使用到的数据只有300多kb,但是我们更多的时候不得不考虑10个gb甚至是1个tb以上的数据的处理。如果处理都无法处理,那我们如何对这些数据进行建模呢?
python对excel表格的处理
首先我们看一个最简单的情况,我们先不考虑性能的问题,那么我们可以使用xlrd这个工具来在python中打开和加载一个excel表格:
# table.py def read_table_by_xlrd(): import xlrd workbook = xlrd.open_workbook(r'data.xls') sheet_name = workbook.sheet_names() print ('all sheets in the file data.xls are: {}'.format(sheet_name)) sheet = workbook.sheet_by_index(0) print ('the cell value of row index 0 and col index 1 is: {}'.format(sheet.cell_value(0, 1))) print ('the elements of row index 0 are: {}'.format(sheet.row_values(0))) print ('the length of col index 1 are: {}'.format(len(sheet.col_values(1)))) if __name__ == '__main__': read_table_by_xlrd()
上述代码的输出如下:
[dechin@dechin-manjaro gold]$ python3 table.py all sheets in the file data.xls are: ['sheet1', 'sheet2', 'sheet3'] the cell value of row index 0 and col index 1 is: 开 the elements of row index 0 are: ['时间', '开', '高', '低', '收', '量', '额'] the length of col index 1 are: 3923
我们这里成功的将一个xls格式的表格加载到了python的内存中,我们可以对这些数据进行分析。如果需要对这些数据修改,可以使用openpyxl这个仓库,但是这里我们不做过多的赘述。
在python中还有另外一个非常常用且非常强大的库可以用来处理表格数据,那就是pandas,这里我们利用ipython这个工具简单展示一下使用pandas处理表格数据的方法:
[dechin@dechin-manjaro gold]$ ipython python 3.8.5 (default, sep 4 2020, 07:30:14) type 'copyright', 'credits' or 'license' for more information ipython 7.19.0 -- an enhanced interactive python. type '?' for help. in [1]: import pandas as pd in [2]: !ls -l 总用量 368 -rw-r--r-- 1 dechin dechin 372736 3月 27 21:31 data.xls -rw-r--r-- 1 dechin dechin 563 3月 27 21:42 table.py in [3]: data = pd.read_excel('data.xls', 'sheet1') # 读取excel格式的文件 in [4]: data.to_csv('data.csv', encoding='utf-8') # 转成csv格式的文件 in [7]: !ls -l 总用量 588 -rw-r--r-- 1 dechin dechin 221872 3月 27 21:52 data.csv -rw-r--r-- 1 dechin dechin 372736 3月 27 21:31 data.xls -rw-r--r-- 1 dechin dechin 563 3月 27 21:42 table.py in [8]: !head -n 10 data.csv # 读取csv文件的头10行 ,时间,开,高,低,收,量,额 0,2002-10-30,83.98,92.38,82.0,83.52,352,29373370 1,2002-10-31,83.9,83.92,83.9,83.91,66,5537480 2,2002-11-01,84.5,84.65,84.0,84.51,77,6502510 3,2002-11-04,84.9,85.06,84.9,84.99,95,8076330 4,2002-11-05,85.1,85.2,85.1,85.13,61,5193650 5,2002-11-06,84.9,84.9,84.9,84.9,1,84900 6,2002-11-07,85.0,85.15,85.0,85.14,26,2212310 7,2002-11-08,85.25,85.28,85.1,85.16,35,2981780 8,2002-11-11,85.18,85.19,85.18,85.19,65,5537050
在ipython中我们不仅可以执行python指令,还可以在前面加一个!就能够执行一些系统命令,非常的方便。csv格式的文件,其实就是用逗号跟换行符来替代常用的\t字符串进行数据的分隔。
但是,不论是使用xlrd还是pandas,我们都会面临一个同样的问题:需要把所有的数据加载到内存中进行处理。我们一般的个人电脑只有8gb-16gb的内存,就算是比较大的64gb的内存,我们也只能够在内存中对64gb以下内存大小的文件进行处理,这对于大数据场景来说远远不够。所以,下一章节中介绍的vaex就是一个很好的解决方案。另外,关于linux下查看本地内存以及使用情况的方法如下:
[dechin@dechin-manjaro gold]$ vmstat procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b 交换 空闲 缓冲 缓存 si so bi bo in cs us sy id wa st 0 0 0 35812168 328340 2904872 0 0 20 27 362 365 8 4 88 0 0 [dechin@dechin-manjaro gold]$ vmstat 2 3 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b 交换 空闲 缓冲 缓存 si so bi bo in cs us sy id wa st 1 0 0 35810916 328356 2905844 0 0 20 27 362 365 8 4 88 0 0 0 0 0 35811916 328364 2904952 0 0 0 6 613 688 1 1 99 0 0 0 0 0 35812168 328364 2904856 0 0 0 0 672 642 0 1 99 0 0
我们可以看到空闲内存大约有36gb的内存,这里我们本机一共有40gb的内存,算是比较大的了。
vaex的安装与使用
vaex提供了一种内存映射的数据处理方案,我们不需要将整个的数据文件加载到内存中进行处理,我们可以直接对硬盘存储进行操作。换句话说,我们所能够处理的文件大小不再受到内存大小的限制,只要在磁盘存储空间允许的范围内,我们都可以对这么大小的文件进行处理。
一般现在个人pc的磁盘最小也有128gb,远远大于内存可以承受的范围。当然,由于分区的不同,不一定能够保障所有的内存资源都能够被使用到,这里附上查看当前目录分区的可用磁盘空间大小查询的方法:
[dechin@dechin-manjaro gold]$ df -hl . 文件系统 容量 已用 可用 已用% 挂载点 /dev/nvme0n1p9 144g 57g 80g 42% /
这里可以看到我们还有80gb的可用磁盘空间,也就是说,如果我们在当前目录放一个80gb大小的表格文件,那么用pandas和xlrd都是没办法处理的,因为这已经远远超出了内存可支持的空间。但是用vaex,我们依然可以对这个文件进行处理。
在vaex的中也介绍有vaex的原理和优势:
vaex的安装
与大多数的python第三方包类似的,我们可以使用pip
来进行下载和管理。当然由于下载的文件会比较多,中间的过程也会较为缓慢,我们只需安静等待即可:
[dechin@dechin-manjaro gold]$ python3 -m pip install vaex collecting vaex downloading vaex-4.1.0-py3-none-any.whl (4.5 kb) collecting vaex-ml<0.12,>=0.11.0 downloading vaex_ml-0.11.1-py3-none-any.whl (95 kb) |████████████████████████████████| 95 kb 81 kb/s collecting vaex-core<5,>=4.1.0 downloading vaex_core-4.1.0-cp38-cp38-manylinux2010_x86_64.whl (2.5 mb) |████████████████████████████████| 2.5 mb 61 kb/s collecting vaex-viz<0.6,>=0.5.0 downloading vaex_viz-0.5.0-py3-none-any.whl (19 kb) collecting vaex-astro<0.9,>=0.8.0 downloading vaex_astro-0.8.0-py3-none-any.whl (20 kb) collecting vaex-hdf5<0.8,>=0.7.0 downloading vaex_hdf5-0.7.0-py3-none-any.whl (15 kb) collecting vaex-server<0.5,>=0.4.0 downloading vaex_server-0.4.0-py3-none-any.whl (13 kb) collecting vaex-jupyter<0.7,>=0.6.0 downloading vaex_jupyter-0.6.0-py3-none-any.whl (42 kb) |████████████████████████████████| 42 kb 82 kb/s requirement already satisfied: traitlets in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-ml<0.12,>=0.11.0->vaex) (5.0.5) requirement already satisfied: numba in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-ml<0.12,>=0.11.0->vaex) (0.51.2) requirement already satisfied: jinja2 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-ml<0.12,>=0.11.0->vaex) (2.11.2) requirement already satisfied: psutil>=1.2.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (5.7.2) requirement already satisfied: six in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (1.15.0) requirement already satisfied: cloudpickle in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (1.6.0) requirement already satisfied: numpy>=1.16 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (1.20.1) requirement already satisfied: dask[array] in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (2.30.0) collecting pyarrow>=3.0 downloading pyarrow-3.0.0-cp38-cp38-manylinux2014_x86_64.whl (20.7 mb) |████████████████████████████████| 20.7 mb 86 kb/s requirement already satisfied: pandas in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (1.1.3) warning: retrying (retry(total=4, connect=none, read=none, redirect=none, status=none)) after connection broken by 'readtimeouterror("httpsconnectionpool(host='pypi.org', port=443): read timed out. (read timeout=15)")': /simple/tabulate/ collecting tabulate>=0.8.3 downloading tabulate-0.8.9-py3-none-any.whl (25 kb) requirement already satisfied: pyyaml in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (5.3.1) collecting frozendict downloading frozendict-1.2.tar.gz (2.6 kb) collecting aplus downloading aplus-0.11.0.tar.gz (3.7 kb) requirement already satisfied: requests in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (2.24.0) requirement already satisfied: nest-asyncio>=1.3.3 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (1.4.2) collecting progressbar2 downloading progressbar2-3.53.1-py2.py3-none-any.whl (25 kb) requirement already satisfied: future>=0.15.2 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (0.18.2) requirement already satisfied: matplotlib>=1.3.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-viz<0.6,>=0.5.0->vaex) (3.3.4) requirement already satisfied: pillow in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-viz<0.6,>=0.5.0->vaex) (8.0.1) requirement already satisfied: astropy in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-astro<0.9,>=0.8.0->vaex) (4.0.2) requirement already satisfied: h5py>=2.9 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-hdf5<0.8,>=0.7.0->vaex) (2.10.0) collecting cachetools downloading cachetools-4.2.1-py3-none-any.whl (12 kb) requirement already satisfied: tornado>4.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-server<0.5,>=0.4.0->vaex) (6.0.4) collecting xarray downloading xarray-0.17.0-py3-none-any.whl (759 kb) |████████████████████████████████| 759 kb 28 kb/s collecting ipympl downloading ipympl-0.7.0-py2.py3-none-any.whl (106 kb) |████████████████████████████████| 106 kb 39 kb/s collecting ipyleaflet downloading ipyleaflet-0.13.6-py2.py3-none-any.whl (3.3 mb) |████████████████████████████████| 3.3 mb 75 kb/s collecting ipyvuetify<2,>=1.2.2 downloading ipyvuetify-1.6.2-py2.py3-none-any.whl (11.7 mb) |████████████████████████████████| 11.7 mb 173 kb/s collecting ipyvolume>=0.4 downloading ipyvolume-0.5.2-py2.py3-none-any.whl (2.9 mb) |████████████████████████████████| 2.9 mb 66 kb/s collecting bqplot>=0.10.1 downloading bqplot-0.12.23-py2.py3-none-any.whl (1.2 mb) |████████████████████████████████| 1.2 mb 175 kb/s requirement already satisfied: ipython-genutils in /home/dechin/anaconda3/lib/python3.8/site-packages (from traitlets->vaex-ml<0.12,>=0.11.0->vaex) (0.2.0) requirement already satisfied: setuptools in /home/dechin/anaconda3/lib/python3.8/site-packages (from numba->vaex-ml<0.12,>=0.11.0->vaex) (50.3.1.post20201107) requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from numba->vaex-ml<0.12,>=0.11.0->vaex) (0.34.0) requirement already satisfied: markupsafe>=0.23 in /home/dechin/anaconda3/lib/python3.8/site-packages (from jinja2->vaex-ml<0.12,>=0.11.0->vaex) (1.1.1) requirement already satisfied: toolz>=0.8.2; extra == "array" in /home/dechin/anaconda3/lib/python3.8/site-packages (from dask[array]->vaex-core<5,>=4.1.0->vaex) (0.11.1) requirement already satisfied: pytz>=2017.2 in /home/dechin/anaconda3/lib/python3.8/site-packages (from pandas->vaex-core<5,>=4.1.0->vaex) (2020.1) requirement already satisfied: python-dateutil>=2.7.3 in /home/dechin/anaconda3/lib/python3.8/site-packages (from pandas->vaex-core<5,>=4.1.0->vaex) (2.8.1) requirement already satisfied: certifi>=2017.4.17 in /home/dechin/anaconda3/lib/python3.8/site-packages (from requests->vaex-core<5,>=4.1.0->vaex) (2020.6.20) requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from requests->vaex-core<5,>=4.1.0->vaex) (1.25.11) requirement already satisfied: idna<3,>=2.5 in /home/dechin/anaconda3/lib/python3.8/site-packages (from requests->vaex-core<5,>=4.1.0->vaex) (2.10) requirement already satisfied: chardet<4,>=3.0.2 in /home/dechin/anaconda3/lib/python3.8/site-packages (from requests->vaex-core<5,>=4.1.0->vaex) (3.0.4) collecting python-utils>=2.3.0 downloading python_utils-2.5.6-py2.py3-none-any.whl (12 kb) requirement already satisfied: cycler>=0.10 in /home/dechin/anaconda3/lib/python3.8/site-packages (from matplotlib>=1.3.1->vaex-viz<0.6,>=0.5.0->vaex) (0.10.0) requirement already satisfied: kiwisolver>=1.0.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from matplotlib>=1.3.1->vaex-viz<0.6,>=0.5.0->vaex) (1.3.0) requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /home/dechin/anaconda3/lib/python3.8/site-packages (from matplotlib>=1.3.1->vaex-viz<0.6,>=0.5.0->vaex) (2.4.7) collecting ipywidgets>=7.6.0 downloading ipywidgets-7.6.3-py2.py3-none-any.whl (121 kb) |████████████████████████████████| 121 kb 175 kb/s requirement already satisfied: ipykernel>=4.7 in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (5.3.4) collecting branca<0.5,>=0.3.1 downloading branca-0.4.2-py3-none-any.whl (24 kb) collecting shapely downloading shapely-1.7.1-cp38-cp38-manylinux1_x86_64.whl (1.0 mb) |████████████████████████████████| 1.0 mb 98 kb/s collecting traittypes<3,>=0.2.1 downloading traittypes-0.2.1-py2.py3-none-any.whl (8.6 kb) collecting ipyvue<2,>=1.5 downloading ipyvue-1.5.0-py2.py3-none-any.whl (2.7 mb) |████████████████████████████████| 2.7 mb 80 kb/s collecting ipywebrtc downloading ipywebrtc-0.5.0-py2.py3-none-any.whl (1.1 mb) |████████████████████████████████| 1.1 mb 99 kb/s collecting pythreejs>=1.0.0 downloading pythreejs-2.3.0-py2.py3-none-any.whl (3.4 mb) |████████████████████████████████| 3.4 mb 30 kb/s requirement already satisfied: widgetsnbextension~=3.5.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (3.5.1) requirement already satisfied: nbformat>=4.2.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (5.0.8) requirement already satisfied: ipython>=4.0.0; python_version >= "3.3" in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (7.19.0) collecting jupyterlab-widgets>=1.0.0; python_version >= "3.6" downloading jupyterlab_widgets-1.0.0-py3-none-any.whl (243 kb) |████████████████████████████████| 243 kb 115 kb/s requirement already satisfied: jupyter-client in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipykernel>=4.7->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (6.1.7) collecting ipydatawidgets>=1.1.1 downloading ipydatawidgets-4.2.0-py2.py3-none-any.whl (275 kb) |████████████████████████████████| 275 kb 73 kb/s requirement already satisfied: notebook>=4.4.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (6.1.4) requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbformat>=4.2.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (3.2.0) requirement already satisfied: jupyter-core in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbformat>=4.2.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (4.6.3) requirement already satisfied: backcall in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.2.0) requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (3.0.8) requirement already satisfied: pickleshare in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.7.5) requirement already satisfied: pexpect>4.3; sys_platform != "win32" in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (4.8.0) requirement already satisfied: pygments in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (2.7.2) requirement already satisfied: jedi>=0.10 in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.17.1) requirement already satisfied: decorator in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (4.4.2) requirement already satisfied: pyzmq>=13 in /home/dechin/anaconda3/lib/python3.8/site-packages (from jupyter-client->ipykernel>=4.7->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (19.0.2) requirement already satisfied: terminado>=0.8.3 in /home/dechin/anaconda3/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.9.1) requirement already satisfied: argon2-cffi in /home/dechin/anaconda3/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (20.1.0) requirement already satisfied: send2trash in /home/dechin/anaconda3/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (1.5.0) requirement already satisfied: nbconvert in /home/dechin/anaconda3/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (6.0.7) requirement already satisfied: prometheus-client in /home/dechin/anaconda3/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.8.0) requirement already satisfied: pyrsistent>=0.14.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.17.3) requirement already satisfied: attrs>=17.4.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (20.3.0) requirement already satisfied: wcwidth in /home/dechin/anaconda3/lib/python3.8/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.2.5) requirement already satisfied: ptyprocess>=0.5 in /home/dechin/anaconda3/lib/python3.8/site-packages (from pexpect>4.3; sys_platform != "win32"->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.6.0) requirement already satisfied: parso<0.8.0,>=0.7.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from jedi>=0.10->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.7.0) requirement already satisfied: cffi>=1.0.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (1.14.3) requirement already satisfied: mistune<2,>=0.8.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.8.4) requirement already satisfied: testpath in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.4.4) requirement already satisfied: pandocfilters>=1.4.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (1.4.3) requirement already satisfied: jupyterlab-pygments in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.1.2) requirement already satisfied: bleach in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (3.2.1) requirement already satisfied: entrypoints>=0.2.2 in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.3) requirement already satisfied: defusedxml in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.6.0) requirement already satisfied: nbclient<0.6.0,>=0.5.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.5.1) requirement already satisfied: pycparser in /home/dechin/anaconda3/lib/python3.8/site-packages (from cffi>=1.0.0->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (2.20) requirement already satisfied: webencodings in /home/dechin/anaconda3/lib/python3.8/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.5.1) requirement already satisfied: packaging in /home/dechin/anaconda3/lib/python3.8/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (20.4) requirement already satisfied: async-generator in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (1.10) building wheels for collected packages: frozendict, aplus building wheel for frozendict (setup.py) ... done created wheel for frozendict: filename=frozendict-1.2-py3-none-any.whl size=3148 sha256=1ae5d8fe0d670f73bf3ee88453978246919197a616f0e08e601c84cc244cb238 stored in directory: /home/dechin/.cache/pip/wheels/9b/9b/56/5713233cf7226423ab6c58c08081551a301b5863e343ba053c building wheel for aplus (setup.py) ... done created wheel for aplus: filename=aplus-0.11.0-py3-none-any.whl size=4412 sha256=9762d51c5ece813b0c5a27ff6ebc1a86e709d55edb7003dcc11272c954dd39c7 stored in directory: /home/dechin/.cache/pip/wheels/de/93/23/3db69e1003030a764c9827dc02137119ec5e6e439afd64eebb successfully built frozendict aplus installing collected packages: pyarrow, tabulate, frozendict, aplus, python-utils, progressbar2, vaex-core, vaex-ml, vaex-viz, vaex-astro, vaex-hdf5, cachetools, vaex-server, xarray, jupyterlab-widgets, ipywidgets, ipympl, branca, shapely, traittypes, ipyleaflet, ipyvue, ipyvuetify, ipywebrtc, ipydatawidgets, pythreejs, ipyvolume, bqplot, vaex-jupyter, vaex attempting uninstall: ipywidgets found existing installation: ipywidgets 7.5.1 uninstalling ipywidgets-7.5.1: successfully uninstalled ipywidgets-7.5.1 successfully installed aplus-0.11.0 bqplot-0.12.23 branca-0.4.2 cachetools-4.2.1 frozendict-1.2 ipydatawidgets-4.2.0 ipyleaflet-0.13.6 ipympl-0.7.0 ipyvolume-0.5.2 ipyvue-1.5.0 ipyvuetify-1.6.2 ipywebrtc-0.5.0 ipywidgets-7.6.3 jupyterlab-widgets-1.0.0 progressbar2-3.53.1 pyarrow-3.0.0 python-utils-2.5.6 pythreejs-2.3.0 shapely-1.7.1 tabulate-0.8.9 traittypes-0.2.1 vaex-4.1.0 vaex-astro-0.8.0 vaex-core-4.1.0 vaex-hdf5-0.7.0 vaex-jupyter-0.6.0 vaex-ml-0.11.1 vaex-server-0.4.0 vaex-viz-0.5.0 xarray-0.17.0
在出现successfully installed
的字样之后,就代表我们已经安装成功,可以开始使用了。
性能对比
由于使用其他的工具我们也可以正常的打开和读取表格文件,为了体现出使用vaex的优势,这里我们直接用ipython来对比一下两者的打开时间:
[dechin@dechin-manjaro gold]$ ipython python 3.8.5 (default, sep 4 2020, 07:30:14) type 'copyright', 'credits' or 'license' for more information ipython 7.19.0 -- an enhanced interactive python. type '?' for help. in [1]: import vaex in [2]: import xlrd in [3]: %timeit xlrd.open_workbook(r'data.xls') 46.4 ms ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) in [4]: %timeit vaex.open('data.csv') 4.95 ms ± 48.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) in [7]: %timeit vaex.open('data.hdf5') 1.34 ms ± 1.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
我们从结果中发现,打开同样的一份文件,使用xlrd需要将近50ms
的时间,而vaex最低只需要1ms
的时间,如此巨大的性能优势使得我们不得不对vaex给予更多的关注。关于跟其他库的对比,在这个中已经有人做过了,即使是对比pandas,vaex在读取速度上也有1000多倍的加速,而计算速度的加速效果在数倍,总体来说表现非常的优秀。
数据格式转换
在上一章节的测试中,我们用到了1个没有提到过的文件:data.hdf5
,这个文件其实是从data.csv
转换而来的。这一章节我们主要就介绍如何将数据格式进行转换,以适配vaex可以打开和识别的格式。第一个方案是使用pandas将csv
格式的文件直接转换为hdf5格式,操作类似于在python对表格数据处理的章节中将xls
格式的文件转换成csv
格式:
[dechin@dechin-manjaro gold]$ ipython python 3.8.5 (default, sep 4 2020, 07:30:14) type 'copyright', 'credits' or 'license' for more information ipython 7.19.0 -- an enhanced interactive python. type '?' for help. in [1]: import pandas as pd in [4]: data = pd.read_csv('data.csv') in [10]: data.to_hdf('data.hdf5','data',mode='w',format='table') in [11]: !ls -l 总用量 932 -rw-r--r-- 1 dechin dechin 221872 3月 27 21:52 data.csv -rw-r--r-- 1 dechin dechin 348524 3月 27 22:17 data.hdf5 -rw-r--r-- 1 dechin dechin 372736 3月 27 21:31 data.xls -rw-r--r-- 1 dechin dechin 563 3月 27 21:42 table.py
操作完成之后在当前目录下生成了一个hdf5文件。但是这种操作方式有个弊端,就是生成的hdf5文件跟vaex不是直接适配的关系,如果直接用df = vaex.open('data.hdf5')
的方法进行读取的话,输出内容如下所示:
in [3]: df out[3]: # table 0 '(0, [83.98, 92.38, 82. , 83.52], [ 0, ... 1 '(1, [83.9 , 83.92, 83.9 , 83.91], [ 1, ... 2 '(2, [84.5 , 84.65, 84. , 84.51], [ 2, ... 3 '(3, [84.9 , 85.06, 84.9 , 84.99], [ 3, ... 4 '(4, [85.1 , 85.2 , 85.1 , 85.13], [ 4, ... ... ... 3,917 '(3917, [274.65, 275.35, 274.6 , 274.61], [ ... 3,918 '(3918, [274.4, 275.2, 274.1, 275. ], [ 391... 3,919 '(3919, [275. , 275.01, 274. , 274.19], [ ... 3,920 '(3920, [275.2, 275.2, 272.6, 272.9], [ 392... 3,921 '(3921, [272.96, 273.73, 272.5 , 272.93], [ ...
在这个数据中,丢失了最关键的索引信息,虽然数据都被正确的保留了下来,但是在读取上有非常大的不便。因此我们更加推荐第二种数据转换的方法,直接用vaex进行数据格式的转换:
[dechin@dechin-manjaro gold]$ ipython python 3.8.5 (default, sep 4 2020, 07:30:14) type 'copyright', 'credits' or 'license' for more information ipython 7.19.0 -- an enhanced interactive python. type '?' for help. in [1]: import vaex in [2]: df = vaex.from_csv('data.csv') in [3]: df.export_hdf5('vaex_data.hdf5') in [4]: !ls -l 总用量 1220 -rw-r--r-- 1 dechin dechin 221856 3月 27 22:34 data.csv -rw-r--r-- 1 dechin dechin 348436 3月 27 22:34 data.hdf5 -rw-r--r-- 1 dechin dechin 372736 3月 27 21:31 data.xls -rw-r--r-- 1 dechin dechin 563 3月 27 21:42 table.py -rw-r--r-- 1 dechin dechin 293512 3月 27 22:52 vaex_data.hdf5
执行完毕后在当前目录下生成了一个vaex_data.hdf5
文件,让我们再试试读取这个新的hdf5文件:
[dechin@dechin-manjaro gold]$ ipython python 3.8.5 (default, sep 4 2020, 07:30:14) type 'copyright', 'credits' or 'license' for more information ipython 7.19.0 -- an enhanced interactive python. type '?' for help. in [1]: import vaex in [2]: df = vaex.open('vaex_data.hdf5') in [3]: df out[3]: # i t s h l e n a 0 0 '2002-10-30' 83.98 92.38 82.0 83.52 352 29373370 1 1 '2002-10-31' 83.9 83.92 83.9 83.91 66 5537480 2 2 '2002-11-01' 84.5 84.65 84.0 84.51 77 6502510 3 3 '2002-11-04' 84.9 85.06 84.9 84.99 95 8076330 4 4 '2002-11-05' 85.1 85.2 85.1 85.13 61 5193650 ... ... ... ... ... ... ... ... ... 3,917 3917 '2018-11-23' 274.65 275.35 274.6 274.61 13478 3708580608 3,918 3918 '2018-11-26' 274.4 275.2 274.1 275.0 13738 3773763584 3,919 3919 '2018-11-27' 275.0 275.01 274.0 274.19 13984 3836845568 3,920 3920 '2018-11-28' 275.2 275.2 272.6 272.9 15592 4258130688 3,921 3921 '2018-11-28' 272.96 273.73 272.5 272.93 592 161576336 in [4]: df.s out[4]: expression = s length: 3,922 dtype: float64 (column) ------------------------------------- 0 83.98 1 83.9 2 84.5 3 84.9 4 85.1 ... 3917 274.65 3918 274.4 3919 275 3920 275.2 3921 272.96 in [11]: df.plot(df.i, df.s, show=true) # 作图 /home/dechin/anaconda3/lib/python3.8/site-packages/vaex/viz/mpl.py:311: userwarning: `plot` is deprecated and it will be removed in version 5.x. please `df.viz.heatmap` instead. warnings.warn('`plot` is deprecated and it will be removed in version 5.x. please `df.viz.heatmap` instead.')
这里我们也需要提一下,在新的hdf5文件中,索引从高、低等中文变成了h、l等英文,这是为了方便数据的操作,我们在csv文件中将索引手动的修改成了英文,再转换成hdf5的格式。最后我们使用vaex自带的画图功能,绘制了这十几年期间黄金的价格变动:
由于vaex自带的绘图方法比较少,总结如下:
最常用的还是热度图,因此这里绘制出来的黄金价格图的效果也是热度图的效果,但是基本上功能是比较完备的,而且性能异常的强大。
总结概要
在这篇文章中我们介绍了三种不同的python库对表格数据进行处理,分别是xlrd、pandas和vaex,其中特别着重的强调了一下vaex的优越性能以及在大数据中的应用价值。配合一些简单的示例,我们可以初步的了解到这些库各自的特点,在实际场景中可以斟酌使用。
以上就是利用python做表格数据处理的详细内容,更多关于python 表格数据处理的资料请关注其它相关文章!