Unicode字符集的编码方式 (UTF-8, UTF-16, UTF-32)

程序员文章站 2022-05-10 19:59:44

...

Unicode共包含1,112,064个有效码点（code points）， Unicode标准定义了UTF-8, UTF-16, UTF-32等编码方式，UTF(unicode transformation format)

UTF-8

超过90%的网站在使用，Unicode的前128个码点是ASCII字符，which means any ASCII text is a UTF-8 text。编码方式如下：

unicode(0x)	bits	UTF-8
0000 0000 ~ 0000 007F	127 (7 bits)	0xxxxxxx
0000 0080 ~ 0000 07FF	2047 (11 bits)	110xxxxx 10xxxxxx
0000 0800 ~ 0000 FFFF	65535 (16 bits)	1110xxxx 10xxxxxx 10xxxxxx
0001 0000 ~ 0010 FFFF	1114111 (21 bits)	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

“林”的unicode是6797，通常写成”\u6797”，二进制为”110 011110 010111”，UTF-8编码为”11100110 10011110 10010111”，UTF-8转成16进制E69E97

UCS-2

UCS-2(2-byte Universal Coded Character Set)固定长度(fixed-length)16-bit编码。UCS-2只能对Unicode的前65536个码点（也叫BMP，basic multilingual plane）编码。UCS-2已经过时，但很多软件中仍在使用。UTF-16拓展了UCS-2，如果只涉及BMP中的码点，UCS-2和UTF-16编码相同。

UTF-16

UTF-16 (16-bit Unicode Transformation Format) 属于变长(variable-length)编码。Windows、Java、JavaScript内部使用UTF-16，在Unix/Linux/MacOS的文件中很少使用。出于安全原因，WHATWG(Web Hypertext Application Technology Working Group)不推荐在web中使用UTF-16，网页只占0.01%。

U+0000 ~ U+D7FF and U+E000 ~ U+FFFF

这个范围内，UTF-16和UCS-2的码点对应，都是16-bit表示。Unicode 9.0中modern non-latin Asian, Middle-eastern and African scripts还有most emoji characters，不在这个范围。

U+10000 ~ U+10FFFF
补充平面的码点编码需要两个16-bit，称为”代理对(surrogate pairs)”。规则如下：

/*
(1) Unicode - 0x010000，剩下20-bit，范围0x000000 ~ 0x0FFFFF
(2) 高10位(0x0000 ~ 0x03FF)，加上0xD800，范围(0xD800 ~ 0XDBFF)，得前16-bit
    称为"高代理(high/leading surrogate)"
(3) 低10位(0x0000 ~ 0x03FF)，加上0xDC00，范围(0xDC00 ~ 0XDFFF)，得后16-bit
    称为"低代理(low/trailing surrogate)"，注意low surrogate的code-unit大于high surrogate
*/

U+D800 ~ U+DFFF

Unicode标准永久保留U+D800 ~ U+DFFF作为UTF-16编码的高低代理，它们不会被赋值，也不应该对它们编码。Unicode标准指出，没有UTF格式（包括UTF-16）编码它们。UCS-2, UTF-8, and UTF-32在一些场合，包括许多软件进行了编码，应该认为是编码错误。

UTF-16编码示例
以U+24B62为例，转为UTF-16分为3步：

/*
(1) 0x24B62 - 0x10000 = 0x14B62
(2) 高代理位(high surrogate)，高10位(00 0101 0010 / 0x400)除以0x400，再加0xD800求和
    (00 0101 0010 / 0x400) + 0xD800 = 0xD852
(3) 低代理位(low surrogate)，低10位加上0xDC00求和
    低10位(11 0110 0010) + 0xDC00 = 0xDF62
*/

Character	Binary code point	Binary UTF-16	Hex UTF-16 BE	Hex UTF-16 LE
U+10437	0001 0000 0100 0011 0111	1101 1000 0000 0001 1101 1100 0011 0111	D8 01 DC 37	01 D8 37 DC
U+24B62	0010 0100 1011 0110 0010	1101 1000 0101 0010 1101 1111 0110 0010	D8 52 DF 62	52 D8 62 DF

UTF-32

定长（fixed-length）编码，能够表示所有Unicode码点，每个码点都是用4个byte表示，空间浪费很大，较少使用。

参考：
https://en.wikipedia.org/wiki/Unicode
https://en.wikipedia.org/wiki/UTF-16
https://tools.ietf.org/html/rfc2781
http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html