欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

大数据BigData之Hive load外部数据时做了些什么?

程序员文章站 2022-07-06 16:27:32
...

@[toc]
文章目录
  1.简介
  2.DEBUG日志详情
  3.结论


1. 简介

下面是在hive debug模式下,截取的一些关键日志内容(按时间顺序记录的,并加了些注释) 如果不想看杂乱的日志文件,我也可以直接告诉你结果。 Hive load外部数据时,先读取外部数据,然后把外部数据 copy 到了本地 hive/warehouse 目录下,最后把外部数据 delete 掉(这操作真骚,Hive为什么要这么做呢?)。


2. DEBUG日志详情

默认main 进程在操作

s3a.S3AFileSystem: op_glob_status += 1  ->  2
s3a.S3AFileSystem: op_get_file_status += 1  ->  2
s3a.S3AFileSystem: Getting path status for s3a://BucketKun/bbb/000000.gz  (bbb/000000.gz)	//get path status
s3a.S3AFileSystem: object_metadata_requests += 1  ->  2
s3a.S3AFileSystem: Found exact file: normal file	//找到了精确文件(normal)
s3a.S3AFileSystem: List status for path: s3a://BucketKun/bbb/000000.gz	//List path
s3a.S3AFileSystem: op_list_status += 1  ->  1
s3a.S3AFileSystem: op_get_file_status += 1  ->  3
s3a.S3AFileSystem: Getting path status for s3a://BucketKun/bbb/000000.gz  (bbb/000000.gz)	//get path status
s3a.S3AFileSystem: object_metadata_requests += 1  ->  3
s3a.S3AFileSystem: Found exact file: normal file	//找到了精确文件(normal)
s3a.S3AFileSystem: Adding: rd (not a dir): s3a://BucketKun/bbb/000000.gz	//???
s3a.S3AFileSystem: op_get_file_status += 1  ->  4
s3a.S3AFileSystem: Getting path status for s3a://BucketKun/bbb/000000.gz  (bbb/000000.gz)	//get path status
s3a.S3AFileSystem: object_metadata_requests += 1  ->  4
s3a.S3AFileSystem: Found exact file: normal file	//找到了精确文件(normal)
s3a.S3AFileSystem: Opening 's3a://BucketKun/bbb/000000.gz' for reading.	//open the file for reading
s3a.S3AFileSystem: op_get_file_status += 1  ->  5  
s3a.S3AFileSystem: Getting path status for s3a://BucketKun/bbb/000000.gz  (bbb/000000.gz)	//get path status
s3a.S3AFileSystem: object_metadata_requests += 1  ->  5
s3a.S3AFileSystem: Found exact file: normal file	//找到了精确文件(normal)
s3a.S3AInputStream: reopen(s3a://BucketKun/bbb/000000.gz) for read from new offset range[0-386], length=4, streamPosition=0, nextReadPosition=0, policy=normal	//reopen for read
s3a.S3AInputStream: Closing stream close() operation: soft	//close
s3a.S3AInputStream: Drained stream of 382 bytes
s3a.S3AInputStream: Stream s3a://BucketKun/bbb/000000.gz closed: close() operation; remaining=382 streamPos=4, nextReadPos=4, request range 0-386 length=386
s3a.S3AInputStream: Statistics of stream bbb/000000.gz	//bbb/000000.gz流的统计数据
StreamStatistics{OpenOperations=1, CloseOperations=1, Closed=1, Aborted=0, SeekOperations=0, ReadExceptions=0, ForwardSeekOperations=0, BackwardSeekOperations=0, BytesSkippedOnSeek=0, BytesBackwardsOnSeek=0, BytesRead=4, BytesRead excluding skipped=4, ReadOperations=1, ReadFullyOperations=0, ReadsIncomplete=0, BytesReadInClose=382, BytesDiscardedInAbort=0, InputPolicy=0, InputPolicySetCount=1}


FileOperations: moving s3a://BucketKun/bbb/000000.gz to file:/user/hive/warehouse/text_gzip5 (replace = KEEP_EXISTING)
s3a.S3AFileSystem: op_glob_status += 1  ->  3
s3a.S3AFileSystem: op_get_file_status += 1  ->  6
s3a.S3AFileSystem: Getting path status for s3a://BucketKun/bbb/000000.gz  (bbb/000000.gz)
s3a.S3AFileSystem: object_metadata_requests += 1  ->  6
s3a.S3AFileSystem: Found exact file: normal file
复制代码

move-thread-0 进程在操作

s3a.S3AFileSystem: op_get_file_status += 1  ->  7
s3a.S3AFileSystem: Getting path status for s3a://BucketKun/bbb/000000.gz  (bbb/000000.gz)	//get path status
s3a.S3AFileSystem: object_metadata_requests += 1  ->  7
s3a.S3AFileSystem: Found exact file: normal file	//找到了精确文件(normal)
s3a.S3AFileSystem: Opening 's3a://BucketKun/bbb/000000.gz' for reading.
s3a.S3AFileSystem: op_get_file_status += 1  ->  8
s3a.S3AFileSystem: Getting path status for s3a://BucketKun/bbb/000000.gz  (bbb/000000.gz)
s3a.S3AFileSystem: object_metadata_requests += 1  ->  8
s3a.S3AFileSystem: Found exact file: normal file
s3a.S3AInputStream: reopen(s3a://BucketKun/bbb/000000.gz) for read from new offset range[0-386], length=4096, streamPosition=0, nextReadPosition=0, policy=normal
s3a.S3AInputStream: Closing stream close() operation: soft
s3a.S3AInputStream: Drained stream of 0 bytes
s3a.S3AInputStream: Stream s3a://BucketKun/bbb/000000.gz closed: close() operation; remaining=0 streamPos=386, nextReadPos=386, request range 0-386 length=386
s3a.S3AInputStream: Statistics of stream bbb/000000.gz
StreamStatistics{OpenOperations=1, CloseOperations=1, Closed=1, Aborted=0, SeekOperations=0, ReadExceptions=0, ForwardSeekOperations=0, BackwardSeekOperations=0, BytesSkippedOnSeek=0, BytesBackwardsOnSeek=0, BytesRead=386, BytesRead excluding skipped=386, ReadOperations=1, ReadFullyOperations=0, ReadsIncomplete=1, BytesReadInClose=0, BytesDiscardedInAbort=0, InputPolicy=0, InputPolicySetCount=1}
s3a.S3AFileSystem: op_get_file_status += 1  ->  9
s3a.S3AFileSystem: Getting path status for s3a://BucketKun/bbb/000000.gz  (bbb/000000.gz)
s3a.S3AFileSystem: object_metadata_requests += 1  ->  9
s3a.S3AFileSystem: Found exact file: normal file
s3a.S3AFileSystem: Delete path s3a://BucketKun/bbb/000000.gz - recursive true
s3a.S3AFileSystem: delete: Path is a file
s3a.S3AFileSystem: object_delete_requests += 1  ->  1
s3a.S3AFileSystem: object_metadata_requests += 1  ->  10s3a.S3AFileSystem: object_metadata_requests += 1  ->  11
s3a.S3AFileSystem: Found file (with /): fake directory
复制代码

main 进程操作

metadata.Hive: Moved src: s3a://BucketKun/bbb/000000.gz, to dest: file:/user/hive/warehouse/text_gzip5/000000_copy_1.gz
复制代码

3. 结论

Hive load外部数据时,先读取外部数据,然后把外部数据 copy 到了本地 hive/warehouse 目录下,最后把外部数据 delete 掉(这操作真骚,Hive为什么要这么做呢?)。

转载于:https://juejin.im/post/5c3b114ce51d452ec6217d11