Hadoop 中文编码相关问题 -- mapreduce程序处理GBK编码数据并输出GBK编码数据

程序员文章站 2022-06-11 18:09:38

Hadoop 中文编码相关问题 -- mapreduce程序处理GBK编码数据并输出GBK编码数据 Hadoop 中文编码相关问题 -- mapreduce程序处理GBK编码数据并输出GBK编码数据 Hadoop 中文编码相关问题 -- mapreduce程序处理GBK编码数据并输出GBK编码数据 ......

hadoop 中文编码相关问题 -- mapreduce程序处理gbk编码数据并输出gbk编码数据

输入是gbk文件, 输出也是 gbk 文件的示例代码:

hadoop处理gbk文本时,发现输出出现了乱码,原来hadoop在涉及编码时都是写死的utf-8，如果文件编码格式是其它类型（如gbk)，则会出现乱码。

此时只需在mapper或reducer程序中读取text时，使用transformtexttoutf8(text, "gbk");进行一下转码，以确保都是以utf-8的编码方式在运行。

public static text transformtexttoutf8(text text, string encoding) {
string value = null;
try {
value = new string(text.getbytes(), 0, text.getlength(), encoding);
} catch (unsupportedencodingexception e) {
e.printstacktrace();
}
return new text(value);
}

这里核心代码是: string line=new string(text.getbytes(),0,text.getlength(),"gbk"); //这里的value是text类型

若直接使用 string line=value.tostring(); 会输出乱码, 这是由text这个writable类型造成的。初学时，一直认为和longwritable对long的封装一样，text类型是string的writable封装。但其实text和string还是有些区别，它是一种utf-8格式的writable，而java中的string是unicode字符。所以直接使用value.tostring()方法，会默认其中的字符都是utf-8编码过的，因而原本gbk编码的数据使用text读入后直接使用该方法就会变成乱码。

正确的方法是将输入的text类型的value转换为字节数组（value.getbytes()），使用string的构造器string(byte[] bytes, int offset, int length, charset charset)，通过使用指定的charset解码指定的byte子数组，构造一个新的string。

如果需要map/reduce输出其它编码格式的数据，需要自己实现outputformat，在其中指定编码方式，而不能使用默认的textoutputformat。

具体的范例可以见淘宝数据平台与产品部官方博客上的博文。

来自： hadoop的map/reduce作业输入非utf-8编码数据的处理原理

以下摘自淘宝数据平台与产品部官方博客：

1 中文问题
从url中解析出中文,但hadoop中打印出来仍是乱码?我们曾经以为hadoop是不支持中文的，后来经过查看源代码，发现hadoop仅仅是不支持以gbk格式输出中文而己。

这是textoutputformat.class中的代码，hadoop默认的输出都是继承自fileoutputformat来的，fileoutputformat的两个子类一个是基于二进制流的输出，一个就是基于文本的输出textoutputformat。

    public static class textoutputformat<k, v> extends fileoutputformat<k, v> {
protected static class linerecordwriter<k, v>
    implements recordwriter<k, v> {
    private static final string utf8 = “utf-8″;//这里被写死成了utf-8
    private static final byte[] newline;
    static {
      try {
        newline = “\n”.getbytes(utf8);
      } catch (unsupportedencodingexception uee) {
        throw new illegalargumentexception(“can’t find ” + utf8 + ” encoding”);
      }
    }
…
    public linerecordwriter(dataoutputstream out, string keyvalueseparator) {
      this.out = out;
      try {
        this.keyvalueseparator = keyvalueseparator.getbytes(utf8);
      } catch (unsupportedencodingexception uee) {
        throw new illegalargumentexception(“can’t find ” + utf8 + ” encoding”);
      }
    }
…
    private void writeobject(object o) throws ioexception {
      if (o instanceof text) {
        text to = (text) o;
        out.write(to.getbytes(), 0, to.getlength());//这里也需要修改
      } else {
        out.write(o.tostring().getbytes(utf8));
      }
    }
…
}
    可以看出hadoop默认的输出写死为utf-8，因此如果decode中文正确，那么将linux客户端的character设为utf-8是可以看到中文的。因为hadoop用utf-8的格式输出了中文。
    因为大多数数据库是用gbk来定义字段的，如果想让hadoop用gbk格式输出中文以兼容数据库怎么办？
    我们可以定义一个新的类：
    public class gbkoutputformat<k, v> extends fileoutputformat<k, v> {
protected static class linerecordwriter<k, v>
    implements recordwriter<k, v> {
//写成gbk即可
    private static final string gbk = “gbk”;
    private static final byte[] newline;
    static {
      try {
        newline = “\n”.getbytes(gbk);
      } catch (unsupportedencodingexception uee) {
        throw new illegalargumentexception(“can’t find ” + gbk + ” encoding”);
      }
    }
…
    public linerecordwriter(dataoutputstream out, string keyvalueseparator) {
      this.out = out;
      try {
        this.keyvalueseparator = keyvalueseparator.getbytes(gbk);
      } catch (unsupportedencodingexception uee) {
        throw new illegalargumentexception(“can’t find ” + gbk + ” encoding”);
      }
    }
…
    private void writeobject(object o) throws ioexception {
      if (o instanceof text) {
// text to = (text) o;
// out.write(to.getbytes(), 0, to.getlength());
// } else {

        out.write(o.tostring().getbytes(gbk));
      }
    }
…
}
    然后在mapreduce代码中加入conf1.setoutputformat(gbkoutputformat.class)
    即可以gbk格式输出中文。

上一篇： CentOS7安装Tomcat 9并进行配置

下一篇：【数字图像处理】自实现Harris角点检测