Jump to content

Binary Ordered Compression for Unicode

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by 217.184.142.58 (talk) at 06:04, 2 June 2008 (Trimmed lead: SCSU details belong in the SCSU article, not the BOCU lead section). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Template:Table Unicode BOCU-1 is a MIME compatible Unicode compression scheme. BOCU stands for Binary Ordered Compression for Unicode. BOCU-1 combines the wide applicability of UTF-8 with the compactness of SCSU. This Unicode encoding is useful for compressing short strings, and it maintains code point order. Usually, the zip, bzip2, and other industry standard algorithms compact larger amounts of Unicode text more efficiently.

SCSU was created as a Unicode compression scheme with a byte/code point ratio similar to language-specific code pages. It has not been widely adopted although it fulfills the criteria for an IANA charset and is registered with IANA. SCSU is not suitable for MIME “text” media types [1]. For example, SCSU cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance.

Details

All numbers in this section are hexadecimal, and all ranges are inclusive.

Code points from U+0000 to U+0020 are encoded in BOCU-1 as the corresponding byte value. All other code points (that is, U+0021 through U+D7FF and U+E000 through U+10FFFF) are encoded as a difference between the code point and a normalized version of the most recently encoded code point that was not an ASCII space (U+0020). If there is no such code point, encoding proceeds as though the previous code point were U+0000. The normalization mapping is as follows:

Code range Normalized code point Notes
U+3040 to U+309F U+3070 Hiragana
U+4E00 to U+9FA5 U+7711 Unihan
U+AC00 to U+D7A3 U+C1D1 Hangul
U+xxxx00 to U+xxxx7F
(excluding ranges above)
U+xxxx40
U+xxxx80 to U+xxxxFF
(excluding ranges above)
U+xxxxC0

The difference between the current code point and the normalized previous code point is encoded as follows:

Difference range Byte sequence range
(see below)
-10FF9F to -2DD0D 21 F0 58 D9 to 21 FF FF FF
-2DD0C to -2912 22 01 01 to 24 FF FF
-2911 to -41 25 01 to 4F FF
-40 to 3F 50 to CF
40 to 2910 D0 01 to FA FF
2911 to 2DD0B FB 01 01 to FD FF FF
2DD0C to 10FFBF FE 01 01 01 to FE 19 B4 54

Each byte range is lexicographically ordered with the following byte values excluded: 00 07 08 09 0A 0B 0C 0D 0E 0F 1A 1B 20. For example, the byte sequence FC 06 FF, coding for a difference of 1156B, is immediately followed by the byte sequence FC 10 01, coding for a difference of 1156C.

References

  1. ^ "UTN #6: BOCU-1 Introduction". Retrieved 2008-05-18.

See also