欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

Python实现的大数据分析操作系统日志功能示例

程序员文章站 2022-05-30 22:50:04
本文实例讲述了python实现的大数据分析操作系统日志功能。分享给大家供大家参考,具体如下: 一 代码 1、大文件切分 import os import o...

本文实例讲述了python实现的大数据分析操作系统日志功能。分享给大家供大家参考,具体如下:

一 代码

1、大文件切分

import os
import os.path
import time
def filesplit(sourcefile, targetfolder):
  if not os.path.isfile(sourcefile):
    print(sourcefile, ' does not exist.')
    return
  if not os.path.isdir(targetfolder):
    os.mkdir(targetfolder)
  tempdata = []
  number = 1000
  filenum = 1
  linesread = 0
  with open(sourcefile, 'r') as srcfile:
    dataline = srcfile.readline().strip()
    while dataline:
      for i in range(number):
        tempdata.append(dataline)
        dataline = srcfile.readline()
        if not dataline:
          break
      desfile = os.path.join(targetfolder, sourcefile[0:-4] + str(filenum) + '.txt')
      with open(desfile, 'a+') as f:
        f.writelines(tempdata)
      tempdata = []
      filenum = filenum + 1
if __name__ == '__main__':
  #sourcefile = input('input the source file to split:')
  #targetfolder = input('input the target folder you want to place the split files:')
  sourcefile = 'test.txt'
  targetfolder = 'test'
  filesplit(sourcefile, targetfolder)

2、mapper代码

import os
import re
import threading
import time
def map(sourcefile):
  if not os.path.exists(sourcefile):
    print(sourcefile, ' does not exist.')
    return
  pattern = re.compile(r'[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}')
  result = {}
  with open(sourcefile, 'r') as srcfile:
    for dataline in srcfile:
      r = pattern.findall(dataline)
      if r:
        t = result.get(r[0], 0)
        t += 1
        result[r[0]] = t
  desfile = sourcefile[0:-4] + '_map.txt'
  with open(desfile, 'a+') as fp:
    for k, v in result.items():
      fp.write(k + ':' + str(v) + '\n')
if __name__ == '__main__':
  desfolder = 'test'
  files = os.listdir(desfolder)
  #如果不使用多线程,可以直接这样写
  '''for f in files:
    map(desfolder + '\\' + f)'''
  #使用多线程
  def main(i):
    map(desfolder + '\\' + files[i])
  filenumber = len(files)
  for i in range(filenumber):
    t = threading.thread(target = main, args =(i,))
    t.start()

3.reducer代码

import os
def reduce(sourcefolder, targetfile):
  if not os.path.isdir(sourcefolder):
    print(sourcefolder, ' does not exist.')
    return
  result = {}
  #deal only with the mapped files
  allfiles = [sourcefolder+'\\'+f for f in os.listdir(sourcefolder) if f.endswith('_map.txt')]
  for f in allfiles:
    with open(f, 'r') as fp:
      for line in fp:
        line = line.strip()
        if not line:
          continue
        position = line.index(':')
        key = line[0:position]
        value = int(line[position + 1:])
        result[key] = result.get(key,0) + value
  with open(targetfile, 'w') as fp:
    for k,v in result.items():
      fp.write(k + ':' + str(v) + '\n')
if __name__ == '__main__':
  reduce('test', 'test\\result.txt')

二 运行结果

依次运行上面3个程序,得到最终结果:

07/10/2013:4634
07/16/2013:51
08/15/2013:3958
07/11/2013:1
10/09/2013:733
12/11/2013:564
02/12/2014:4102
05/14/2014:737

更多关于python相关内容感兴趣的读者可查看本站专题:《python日志操作技巧总结》、《python函数使用技巧总结》、《python字符串操作技巧汇总》、《python入门与进阶经典教程》及《python文件与目录操作技巧汇总

希望本文所述对大家python程序设计有所帮助。