UTF-1
UTF-1 is one way of transforming ISO 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for multi-byte encodings, making it unsuited for some uses (for instance Unix filenames cannot contain the byte value used for forward slash decode due to its use of division and multiplication by a number which is not a power of 2. Due to these issues, it did not gain acceptance and was quickly replaced by UTF-8.
Design
UTF-1 is a multi-byte encoding like UTF-8; a single Unicode code point can be encoded in one, two, three, or five bytes. The ASCII range is encoded as one byte (all code points from U+0000 to U+009F are).
UTF-1 does not use the C0 and C1 control codes or the space character in multi-byte encodings, the bytes 0 - 0x20 or 0x7F - 0x9F always stand for the corresponding code point. This design with 66 protected characters tried to be ISO 2022 compatible.
UTF-1 uses "modulo 190" arithmetic (256 − 66 = 190). For comparison, UTF-8 protects all 128 ASCII characters and needs one bit for this, and a second bit to make it self-synchronizing, resulting in "modulo 64" arithmetic (8 − 2 = 6; 26 = 64). BOCU-1 protects only the minimal set required for MIME-compatibility (0x00, 0x07–0x0F, 0x1A–0x1B, and 0x20), resulting in "modulo 243" arithmetic (256 − 13 = 243).
code point | UTF-8 | UTF-1 |
---|---|---|
U+007F | 7F | 7F |
U+0080 | C2 80 | 80 |
U+009F | C2 9F | 9F |
U+00A0 | C2 A0 | A0 A0 |
U+00BF | C2 BF | A0 BF |
U+00C0 | C3 80 | A0 C0 |
U+00FF | C3 BF | A0 FF |
U+0100 | C4 80 | A1 21 |
U+015D | C5 9D | A1 7E |
U+015E | C5 9E | A1 A0 |
U+01BD | C6 BD | A1 FF |
U+01BE | C6 BE | A2 21 |
U+07FF | DF BF | AA 72 |
U+0800 | E0 A0 80 | AA 73 |
U+0FFF | E0 BF BF | B5 48 |
U+1000 | E1 80 80 | B5 49 |
U+4015 | E4 80 95 | F5 FF |
U+4016 | E4 80 96 | F6 21 21 |
U+D7FF | ED 9F BF | F7 2F C3 |
U+E000 | EE 80 80 | F7 3A 79 |
U+F8FF | EF A3 BF | F7 5C 3C |
U+FDD0 | EF B7 90 | F7 62 BA |
U+FDEF | EF B7 AF | F7 62 D9 |
U+FEFF | EF BB BF | F7 64 4C |
U+FFFD | EF BF BD | F7 65 AD |
U+FFFE | EF BF BE | F7 65 AE |
U+FFFF | EF BF BF | F7 65 AF |
U+10000 | F0 90 80 80 | F7 65 B0 |
U+38E2D | F0 B8 B8 AD | FB FF FF |
U+38E2E | F0 B8 B8 AE | FC 21 21 21 21 |
U+FFFFF | F3 BF BF BF | FC 21 37 B2 7A |
U+100000 | F4 80 80 80 | FC 21 37 B2 7B |
U+10FFFF | F4 8F BF BF | FC 21 39 6E 6C |
U+7FFFFFFF | FD BF BF BF BF BF | FD BD 2B B9 40 |
Although modern Unicode ends at U+10FFFF, both UTF-1 and UTF-8 were designed to encode the complete 31 bits of the original Universal Character Set (UCS-4), and the last entry in this table shows this original final code point.
See also
References
- ISO/IEC JTC 1/SC2/WG2 (1993-01-21). "ISO IR 178: UCS Transformation Format One (UTF-1)" (PDF) (PDF, 256 KB) (1 ed.). Registration number 178. Archived from the original (PDF) on 2015-03-18.
{{cite web}}
: Unknown parameter|dead-url=
ignored (|url-status=
suggested) (help)CS1 maint: numeric names: authors list (link) - Czyborra, Roman (1998-11-30). "Unicode Transformation Formats: UTF-8 & Co". Archived from the original on 2016-06-07. Retrieved 2016-06-07.
{{cite web}}
: Unknown parameter|dead-url=
ignored (|url-status=
suggested) (help)