Binary Ordered Compression for Unicode

Template:Table Unicode BOCU-1 is a MIME compatible Unicode compression scheme. BOCU stands for Binary Ordered Compression for Unicode. BOCU-1 combines the wide applicability of UTF-8 with the compactness of SCSU. This Unicode encoding is designed to be useful for compressing short strings, and maintains code point order. BOCU-1 is specified in an Unicode Technical Note.^[1]

For comparison SCSU was adopted as standard Unicode compression scheme with a byte/code point ratio similar to language-specific code pages. SCSU has not been widely adopted, as it is not suitable for MIME “text” media types. For example, SCSU cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance. Usually, the zip, bzip2, and other industry standard algorithms compact larger amounts of Unicode text more efficiently.^[2].

Both SCSU^[3] and BOCU-1^[4] are IANA registered charsets.

Details

All numbers in this section are hexadecimal, and all ranges are inclusive.

Code points from U+0000 to U+0020 are encoded in BOCU-1 as the corresponding byte value. All other code points (that is, U+0021 through U+D7FF and U+E000 through U+10FFFF) are encoded as a difference between the code point and a normalized version of the most recently encoded code point that was not an ASCII space (U+0020). The initial state is U+0040. The normalization mapping is as follows:

Code range	Normalized code point	Notes
`U+3040` to `U+309F`	`U+3070`	Hiragana
`U+4E00` to `U+9FA5`	`U+7711`	Unihan
`U+AC00` to `U+D7A3`	`U+C1D1`	Hangul
`U+0020`	(encoder state kept as is)	Space
`U+xxxx00` to `U+xxxx7F` (excluding ranges above)	`U+xxxx40`	middle of 128
`U+xxxx80` to `U+xxxxFF` (excluding ranges above)	`U+xxxxC0`	middle of 128

The difference between the current code point and the normalized previous code point is encoded as follows:

Difference range	Byte sequence range (see below)
`-10FF9F` to `-2DD0D`	`21` `F0` `58` `D9` to `21` `FF` `FF` `FF`
`-2DD0C` to `-2912`	`22` `01` `01` to `24` `FF` `FF`
`-2911` to `-41`	`25` `01` to `4F` `FF`
`-40` to `3F`	`50` to `CF`
`40` to `2910`	`D0` `01` to `FA` `FF`
`2911` to `2DD0B`	`FB` `01` `01` to `FD` `FF` `FF`
`2DD0C` to `10FFBF`	`FE` `01` `01` `01` to `FE` `19` `B4` `54`

Each byte range is lexicographically ordered with the following byte values excluded: 00 07 08 09 0A 0B 0C 0D 0E 0F 1A 1B 20. For example, the byte sequence FC 06 FF, coding for a difference of 1156B, is immediately followed by the byte sequence FC 10 01, coding for a difference of 1156C.

Any ASCII input U+0000 to U+007F excluding space U+0020 resets the encoder to U+0040. Because the above mentioned values cover line end code points U+000D and U+000A as is (0D 0A), the encoder is in a known state at the begin of each line. The corruption of a single byte therefore affects at most one line. For comparison, the corruption of a single byte in UTF-8 affects at most one code point, for SCSU it can affect the entire document.

References

^ "UTN #6: BOCU-1". Retrieved 2008-05-18.
^ "UTN #14: A Survey of Unicode Compression". Retrieved 2008-06-02.
^ IANA registration record for SCSU
^ IANA registration record for BOCU-1

Details

References

See also