VBA处理数据与Python Pandas处理数据案例比较分析

程序员文章站 2022-06-15 17:41:10

需求：现有一个 csv文件，包含'cnum'和'company'两列，数据里包含空行，且有内容重复的行数据。要求：1）去掉空行；2）重复行数据只保留一行有效数据；3）修改'company'列的名称为'...

需求：

现有一个 csv文件，包含'cnum'和'company'两列，数据里包含空行，且有内容重复的行数据。

要求：

1）去掉空行；
2）重复行数据只保留一行有效数据；
3）修改'company'列的名称为'company_new‘；
4）并在其后增加六列，分别为'c_col',‘d_col',‘e_col',‘f_col',‘g_col',‘h_col'。

VBA处理数据与Python Pandas处理数据案例比较分析

一，使用 python pandas来处理：

import pandas as pd
import numpy as np
from pandas import dataframe,series

def deal_with_data(filepath,newpath):
  file_obj=open(filepath)
  df=pd.read_csv(file_obj)  # 读取csv文件，创建 dataframe
  df=df.reindex(columns=['cnum','company','c_col','d_col','e_col','f_col','g_col','h_col'],fill_value=none)  # 重新指定列索引
  df.rename(columns={'company':'company_new'}, inplace = true) # 修改列名
  df=df.dropna(axis=0,how='all')         # 去除 nan 即文件中的空行
  df['cnum'] = df['cnum'].astype('int32')    # 将 cnum 列的数据类型指定为 int32
  df = df.drop_duplicates(subset=['cnum', 'company_new'], keep='first') # 去除重复行
  df.to_csv(newpath,index=false,encoding='gbk')
  file_obj.close()
  
if __name__=='__main__':
  file_path=r'c:\users\12078\desktop\python\cnum_company.csv'
  file_save_path=r'c:\users\12078\desktop\python\cnum_company_output.csv'
  deal_with_data(file_path,file_save_path)

二，使用 vba来处理：

option base 1
option explicit

sub main()
 on error goto error_handling
 dim wb         as workbook
 dim wb_out       as workbook
 dim sht         as worksheet
 dim sht_out       as worksheet
 dim rng         as range
 dim usedrows      as byte
 dim usedrows_out    as byte
 dim dict_cnum_company  as object
 dim str_file_path    as string
    dim str_new_file_path  as string
    'assign values to variables:
    str_file_path = "c:\users\12078\desktop\python\cnum_company.csv"
    str_new_file_path = "c:\users\12078\desktop\python\cnum_company_output.csv"
 
 set wb = checkandattachworkbook(str_file_path)
 set sht = wb.worksheets("cnum_company")
 set wb_out = workbooks.add
 wb_out.saveas str_new_file_path, xlcsv 'create a csv file
 set sht_out = wb_out.worksheets("cnum_company_output")

 set dict_cnum_company = createobject("scripting.dictionary")
 usedrows = worksheetfunction.max(getlastvalidrow(sht, "a"), getlastvalidrow(sht, "b"))

 'rename the header 'company' to 'company_new',remove blank & duplicate lines/rows.
 dim cnum_company as string
 cnum_company = ""
 for each rng in sht.range("a1", "a" & usedrows)
   if vba.trim(rng.offset(0, 1).value) = "company" then
     rng.offset(0, 1).value = "company_new"
   end if
   cnum_company = rng.value & "-" & rng.offset(0, 1).value
   if vba.trim(cnum_company) <> "-" and not dict_cnum_company.exists(rng.value & "-" & rng.offset(0, 1).value) then
     dict_cnum_company.add rng.value & "-" & rng.offset(0, 1).value, ""
   end if
 next rng
 
 'loop the keys of dict split the keyes by '-' into cnum array and company array.
 dim index_dict as byte
 dim arr_cnum()
 dim arr_company()
 for index_dict = 0 to ubound(dict_cnum_company.keys)
   redim preserve arr_cnum(1 to ubound(dict_cnum_company.keys) + 1)
   redim preserve arr_company(1 to ubound(dict_cnum_company.keys) + 1)
   arr_cnum(index_dict + 1) = split(dict_cnum_company.keys()(index_dict), "-")(0)
   arr_company(index_dict + 1) = split(dict_cnum_company.keys()(index_dict), "-")(1)
   debug.print index_dict
 next

 'assigns the value of the arrays to the celles.
 sht_out.range("a1", "a" & ubound(arr_cnum)) = application.worksheetfunction.transpose(arr_cnum)
 sht_out.range("b1", "b" & ubound(arr_company)) = application.worksheetfunction.transpose(arr_company)

 'add 6 columns to output csv file:
 dim arr_columns() as variant
 arr_columns = array("c_col", "d_col", "e_col", "f_col", "g_col", "h_col")  '
 sht_out.range("c1:h1") = arr_columns
 call checkandcloseworkbook(str_file_path, false)
 call checkandcloseworkbook(str_new_file_path, true)

exit sub
error_handling:
  call checkandcloseworkbook(str_file_path, false)
  call checkandcloseworkbook(str_new_file_path, false)
end sub

' 辅助函数：
'get last row of column n in a worksheet
function getlastvalidrow(in_ws as worksheet, in_col as string)
  getlastvalidrow = in_ws.cells(in_ws.rows.count, in_col).end(xlup).row
end function

function checkandattachworkbook(in_wb_path as string) as workbook
  dim wb as workbook
  dim mywb as string
  mywb = in_wb_path
  
  for each wb in workbooks
    if lcase(wb.fullname) = lcase(mywb) then
      set checkandattachworkbook = wb
      exit function
    end if
  next
  
  set wb = workbooks.open(in_wb_path, updatelinks:=0)
  set checkandattachworkbook = wb

end function
 
function checkandcloseworkbook(in_wb_path as string, in_saved as boolean)
  dim wb as workbook
  dim mywb as string
  mywb = in_wb_path
  for each wb in workbooks
    if lcase(wb.fullname) = lcase(mywb) then
      wb.close savechanges:=in_saved
      exit function
    end if
  next
end function

三，输出结果：

VBA处理数据与Python Pandas处理数据案例比较分析

两种方法输出结果相同：

四，比较总结：

python pandas 内置了大量处理数据的方法，我们不需要重复造*，用起来很方便，代码简洁的多。
excel vba 处理这个需求，使用了数组，字典等数据结构（实际需求中，数据量往往很大，所以一些地方没有直接使用遍历单元格的方法），以及处理字符串，数组和字典的很多方法，对文件的操作也很复杂，一旦出错，调试起来比python也较困难，代码已经尽量优化，但还是远比 python要多。

到此这篇关于vba处理数据与python pandas处理数据案例比较分析的文章就介绍到这了,更多相关vba与python pandas处理数据内容请搜索以前的文章或继续浏览下面的相关文章希望大家以后多多支持！

上一篇： ONLYOFFICE连接数20个限制的由来

下一篇： pyCharm 设置调试输出窗口中文显示方式(字符码转换)

VBA处理数据与Python Pandas处理数据案例比较分析

整理总结 python 中时间日期类数据处理与类型转换(含 pandas)

荐 14天数据分析与机器学习实践之Day02——数据分析处理库Pandas应用总结

利用Python进行数据分析_Pandas_处理缺失数据

《利用Python进行数据分析》第5章 pandas的数据汇总与处理缺失数据

Python数据分析实例，用户家用电器功率分析。Pandas时间序列处理以及聚合实践

【python】用pandas处理时序数据处理与高效处理建议

整理总结 python 中时间日期类数据处理与类型转换(含 pandas)

Python数据分析之缺失值检测与处理详解

VBA处理数据与Python Pandas处理数据案例比较分析

荐 14天数据分析与机器学习实践之Day02——数据分析处理库Pandas应用总结