Read a large file with python

程序员文章站 2022-07-02 16:35:43

python读取大文件 1. 较pythonic的方法，使用with结构文件可以自动关闭异常可以在with块内处理

python读取大文件

较pythonic的方法，使用with结构

文件可以自动关闭
异常可以在with块内处理

    with open(filename, 'rb') as f:  
        for line in f:
            <do someting with the line>

最大的优点：对可迭代对象 f，进行迭代遍历：for line in f，会自动地使用缓冲io（buffered io）以及内存管理，而不必担心任何大文件的问题。

there should be one – and preferably only one – obvious way to do it.

使用生成器generator

如果想对每次迭代读取的内容进行更细粒度的处理，可以使用yield生成器来读取大文件

    def readinchunks(file_obj, chunksize=2048):
        """
        lazy function to read a file piece by piece.  
        default chunk size: 2kb.
        """
        while true:
            data = file_obj.read(chunksize)
            if not data:
                break
            yield data
    f = open('bigfile')
    for chunk in readinchunks(f):
        do_something(chunk)
    f.close()

linux下使用split命令（将一个文件根据大小或行数平均分成若干个小文件）

    wc -l blm.txt  # 读出blm.txt文件一共有多少行
    # 利用split进行分割
    split -l 2482 ../blm/blm.txt -d -a 4 blm_
    # 将 文件 blm.txt 分成若干个小文件，每个文件2482行(-l 2482)，文件前缀为blm_ ，系数不是字母而是数字（-d），后缀系数为四位数（-a 4）  


    # 按行数分割
    split -l 300 large_file.txt new_file_prefix
    # 文件大小分割
    split -b 10m server.log waynelog

    # 对文件进行合并：使用重定向，'>' 写入文件  , '>>' 追加到文件中
    cat file_prefix* > large_file

在工作中的日常：用户信息，log日志缓存，等都是大文件

补充：linecache模块

当读取一个文件的时候，python会尝试从缓存中读取文件内容，优化读取速度，提高效率，减少了i/o操作

linecache.getline(filename, lineno) 从文件中读取第几行，注意：包含换行符
linecache.clearcache() 清除现有的文件缓存
linecache.checkcache(filename=none) 检查缓存内容的有效性，可能硬盘内容发生改变，更新了，如果没有参数，将检查缓存中的所有记录(entries)

    import linecache
    linecache.getline(linecache.__file__, 8)

题目：
现给一个文件400m（该文件是由/etc/passwd生成的），统计其中root字符串出现的次数

    import time
    sum = 0
    start = time.time()
    with open('file', 'r') as f:
        for i in f:
            new = i.count('root')
            sum+=new
    end = time.time()
    print(sum, end-start)

注:有时候这个程序比c,shell快10倍，原因就是，python会读取cache中的数据，使用缓存在内部进行优化，减少i/o，提高效率

references : how to read a large file

上一篇： SpringBoot2.0源码分析（二）：整合ActiveMQ分析

下一篇： [C语言] 数据结构-预备知识动态内存分配

Read a large file with python

python读取大文件

补充：linecache模块

Python 文件操作技巧(File operation) 实例代码分析

Python File(文件) 方法整理

Python判断对象是否为文件对象(file object)的三种方法示例

WAC启动Android模拟器 transfer error: Read-only file system错误解决方法

bash: /usr/bin/autocrorder: /usr/bin/python^M: bad interpreter: No such file or directory

【第八篇】Python的文件(file)操作

python使用response.read()接收json数据的实例

python升级带来的yum异常(解决错误File "/usr/bin/yum", line 30 except KeyboardInterrupt, e:)

Python File(文件) 方法整理

在Python中操作文件之read()方法的使用教程