java 读写Parquet格式的数据的示例代码
程序员文章站
2024-03-02 13:01:46
本文介绍了java 读写parquet格式的数据,分享给大家,具体如下:
import java.io.bufferedreader;
import jav...
本文介绍了java 读写parquet格式的数据,分享给大家,具体如下:
import java.io.bufferedreader; import java.io.file; import java.io.filereader; import java.io.ioexception; import java.util.random; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.log4j.logger; import org.apache.parquet.example.data.group; import org.apache.parquet.example.data.groupfactory; import org.apache.parquet.example.data.simple.simplegroupfactory; import org.apache.parquet.hadoop.parquetreader; import org.apache.parquet.hadoop.parquetreader.builder; import org.apache.parquet.hadoop.parquetwriter; import org.apache.parquet.hadoop.example.groupreadsupport; import org.apache.parquet.hadoop.example.groupwritesupport; import org.apache.parquet.schema.messagetype; import org.apache.parquet.schema.messagetypeparser; public class readparquet { static logger logger=logger.getlogger(readparquet.class); public static void main(string[] args) throws exception { // parquetwriter("test\\parquet-out2","input.txt"); parquetreaderv2("test\\parquet-out2"); } static void parquetreaderv2(string inpath) throws exception{ groupreadsupport readsupport = new groupreadsupport(); builder<group> reader= parquetreader.builder(readsupport, new path(inpath)); parquetreader<group> build=reader.build(); group line=null; while((line=build.read())!=null){ group time= line.getgroup("time", 0); //通过下标和字段名称都可以获取 /*system.out.println(line.getstring(0, 0)+"\t"+ line.getstring(1, 0)+"\t"+ time.getinteger(0, 0)+"\t"+ time.getstring(1, 0)+"\t");*/ system.out.println(line.getstring("city", 0)+"\t"+ line.getstring("ip", 0)+"\t"+ time.getinteger("ttl", 0)+"\t"+ time.getstring("ttl2", 0)+"\t"); //system.out.println(line.tostring()); } system.out.println("读取结束"); } //新版本中new parquetreader()所有构造方法好像都弃用了,用上面的builder去构造对象 static void parquetreader(string inpath) throws exception{ groupreadsupport readsupport = new groupreadsupport(); parquetreader<group> reader = new parquetreader<group>(new path(inpath),readsupport); group line=null; while((line=reader.read())!=null){ system.out.println(line.tostring()); } system.out.println("读取结束"); } /** * * @param outpath 输出parquet格式 * @param inpath 输入普通文本文件 * @throws ioexception */ static void parquetwriter(string outpath,string inpath) throws ioexception{ messagetype schema = messagetypeparser.parsemessagetype("message pair {\n" + " required binary city (utf8);\n" + " required binary ip (utf8);\n" + " repeated group time {\n"+ " required int32 ttl;\n"+ " required binary ttl2;\n"+ "}\n"+ "}"); groupfactory factory = new simplegroupfactory(schema); path path = new path(outpath); configuration configuration = new configuration(); groupwritesupport writesupport = new groupwritesupport(); writesupport.setschema(schema,configuration); parquetwriter<group> writer = new parquetwriter<group>(path,configuration,writesupport); //把本地文件读取进去,用来生成parquet格式文件 bufferedreader br =new bufferedreader(new filereader(new file(inpath))); string line=""; random r=new random(); while((line=br.readline())!=null){ string[] strs=line.split("\\s+"); if(strs.length==2) { group group = factory.newgroup() .append("city",strs[0]) .append("ip",strs[1]); group tmpg =group.addgroup("time"); tmpg.append("ttl", r.nextint(9)+1); tmpg.append("ttl2", r.nextint(9)+"_a"); writer.write(group); } } system.out.println("write end"); writer.close(); } }
说下schema(写parquet格式数据需要schema,读取的话"自动识别"了schema)
/* * 每一个字段有三个属性:重复数、数据类型和字段名,重复数可以是以下三种: * required(出现1次) * repeated(出现0次或多次) * optional(出现0次或1次) * 每一个字段的数据类型可以分成两种: * group(复杂类型) * primitive(基本类型) * 数据类型有 * int64, int32, boolean, binary, float, double, int96, fixed_len_byte_array */
这个repeated和required 不光是次数上的区别,序列化后生成的数据类型也不同,比如repeqted修饰 ttl2 打印出来为 wrappedarray([7,7_a]) 而 required修饰 ttl2 打印出来为 [7,7_a] 除了用messagetypeparser.parsemessagetype类生成messagetype 还可以用下面方法
(注意这里有个坑--spark里会有这个问题--ttl2这里 as(originaltype.utf8) 和 required binary city (utf8)作用一样,加上utf8,在读取的时候可以转为stringtype,不加的话会报错 [b cannot be cast to java.lang.string )
/*messagetype schema = messagetypeparser.parsemessagetype("message pair {\n" + " required binary city (utf8);\n" + " required binary ip (utf8);\n" + "repeated group time {\n"+ "required int32 ttl;\n"+ "required binary ttl2;\n"+ "}\n"+ "}");*/ //import org.apache.parquet.schema.types; messagetype schema = types.buildmessage() .required(primitivetypename.binary).as(originaltype.utf8).named("city") .required(primitivetypename.binary).as(originaltype.utf8).named("ip") .repeatedgroup().required(primitivetypename.int32).named("ttl") .required(primitivetypename.binary).as(originaltype.utf8).named("ttl2") .named("time") .named("pair");
解决 [b cannot be cast to java.lang.string 异常:
1.要么生成parquet文件的时候加个utf8
2.要么读取的时候再提供一个同样的schema类指定该字段类型,比如下面:
maven依赖(我用的1.7)
<dependency> <groupid>org.apache.parquet</groupid> <artifactid>parquet-hadoop</artifactid> <version>1.7.0</version> </dependency>
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持。