python 合并多张excel 处理重复数据
程序员文章站
2022-04-13 23:49:43
...
故事提要:
有个运营的朋友, 6张excel 表需要合并到一张表, 每张表大约65536 个数据, 其中重复的item 需要按一定规则合并。
如果用电脑直接合并excel , 由于数据过多, 系统总是崩溃, 朋友试了很多方法一周也没完美搞定。
正好帮个忙,拿来用python练练手~
比如下面两个重复的数据:
合并后的结果是:
pwk_id :不变, 访问次数: 1+1 = 2 , 操作系统: PC , 总计访问时长:565 + 565 = 1130 , pv: max(2,2) = 2
Do it now:
两小时时间边学边实践写了个简单的python程序, 帮他生成一下合并结果。以后也能用,妈妈再也不用担心我们的数据重复的问题了。
使用经典python 2.7.10 工具库 xlwt + xlrd
下面是所有程序源代码:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import xlwt
import xlrd
import os
# 需要合并的文件目录
dir = '/Users/jesselu/Desktop/excels/'
files = os.listdir(dir)
files = map(lambda path:os.path.join(dir,path),files)
print files
# book1 = xlrd.open_workbook(file,"rd")
def generateResult(files):
def printSheetInfo(book):
print "表单数量", book.nsheets
print "表单名称", book.sheet_names()
sheet0 = book.sheet_by_index(0)
print u"表单%s 共 %d 行 %d 列" % (book.sheet_names()[0], sheet0.nrows, sheet0.ncols)
def mergeRow(old, new):
old[1].value = old[1].value + new[1].value
old[3].value = old[3].value + new[3].value
old[4].value = max(old[4].value, new[4].value)
return old
resultDict = {}
# 合并重复项
for path in files:
book = xlrd.open_workbook(path, "rd")
printSheetInfo(book)
for sheet in book.sheets():
for r in range(sheet.nrows):
row = sheet.row(r)
old_row = resultDict.get(row[0].value, -1)
if old_row < 0:
resultDict[row[0].value] = row
else:
resultDict[row[0].value] = mergeRow(old_row, row)
return resultDict
def write(resultDict):
wb = xlwt.Workbook(encoding='utf-8')
def writeRow(sheet, rowNum, data):
sheet.write(rowNum,0, data[0].value)
sheet.write(rowNum, 1, data[1].value)
sheet.write(rowNum, 2, data[2].value)
sheet.write(rowNum, 3, data[3].value)
sheet.write(rowNum, 4, data[4].value)
def writeHeader(sheet):
sheet.write(0,0,'pwk_id')
sheet.write(0,1,'访问次数')
sheet.write(0, 2, '操作系统')
sheet.write(0, 3, '总计访问时长')
sheet.write(0, 4, 'pv')
sheetId = 0
row_num = 0
resultSheet = wb.add_sheet('mergedSheet', cell_overwrite_ok=True)
writeHeader(resultSheet)
for d, x in resultDict.items():
row_num = row_num + 1
if row_num % 65535 == 0:
sheetId = sheetId + 1
resultSheet = wb.add_sheet(str.format('mergedSheet{}',sheetId), cell_overwrite_ok=True)
writeHeader(resultSheet)
row_num = 0
else:
writeRow(resultSheet,row_num, x)
wb.save('/Users/jesselu/Desktop/excels/result.xls')
if __name__ == "__main__":
write(generateResult(files))
结果:
修改路径之后和处理逻辑后直接可用, 大约1分钟, 合并万6张表, 去掉重复数据4万多个。
感悟:
python 练手项目, “人生苦短,早用python”
参考资料:
python excel 社区官方文档
https://
github.com/python-excel /tutorial/raw/master/python-excel.pdf
https:// github.com/python-excel
上一篇: ps中分辨率怎么调
下一篇: java生成验证码详细步骤解析