Variable-width encoding

Variable-width encoding is a character encoding scheme in which units of differing lengths are used to encode a coded character set (a repertoire with numbers assigned to it) in computer memory or storage. It is also known as a multibyte encoding, though this is a less accurate term, since not all variable-width encodings use 8-bit units (UTF-16, for example, is a variable-width encoding that uses 16-bit units).

Variable-width encodings are always the result of requiring to break an encoding range limit without breaking backward compatibility with an existing legacy constraint. For example, with 8 bits per character, one can encode 256 possible characters; in order to encode more than 256 character, the obvious choice would be to increase the number of bits per character, such as to 16 bits for 65,536 possible characters, but such a change would break compatibility with existing systems and therefore might not be feasible at all. The first variable-width encodings, the ISO 2022 encodings for Chinese, Japanese and Korean, were even further constrained to the limit of 7 bits per character.

General Structure

A variable-width encoding adds a layer of using 1+x units (where x>0) for encoding characters outside the range that the use of a single unit allows to encode. The single-unit layer coexists with the multiunit additions. The result is that there are three sorts of units in a variable-width encoding: singletons, which consist of a single unit, lead units, which come first in a multiunit sequence, and trail units, which come afterwards in a multiunit sequence. For example the word can’t (thus, with a right single quotation mark for the apostrophe, not the ASCII apostrophe) is encoded thus in UTF-8: 63 61 6E E2 80 99 74. In this sequence, 63, 61, 6E and 74 are singletons, E2 is a lead unit and 80 and 99 are trail units.

UTF-8 is one of the best-designed variable-width encodings, so the three sorts of units are kept apart and easy to identify. Other variable-width encodings may not be so well designed, and in them the trail and lead units overlap (same numbers for both). Some are so badly designed that all three overlap. Where there is overlap, a text processing application that deals with the variable-encoding must scan the text from the beginning of all definitive sequences in order to identify the various units properly and render the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if DE and DF and E0 and E1 can all be either lead units or trail units, then a search for the two-unit sequence DF E0 can yield a false positive in the two consecutive two-unit sequences DE DF E0 E1. There is then also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences totally different. In a variable-width encoding where all three sorts of units are disjunct, string searching always work without false positives, and the corruption of one unit corrupts only one character.

CJK variable-width encodings

The first use of variable-width encodings was for the encoding of Chinese, Japanese and Korean, which have large character sets well in excess of 256 characters. At first the encoding was constrained to the limit of 7 bits. The ISO-2022-JP, ISO-2022-CN and ISO-2022-KR encodings used the range 21-7E for both lead units and trail units, and marked them off from the singletons by using ISO 2022 escape sequences to switch between single-byte and multibyte mode. A total of 8,836 (94×94) characters could be encoded at first, and three-byte sequences were added later. The ISO 2022 encoding schemes for CJK are still in use on the Internet.

On Unix platforms, the ISO 2022 7-bit encodings were replaced by a 8-bit encoding schemes, the Extended Unix Code: EUC-JP, EUC-CN and EUC-KR. Instead of distinguishing between the multiunit sequences and the singletons with escape sequences, which made the encodings stateful, multiunit sequences were marked by having the most significant bit set, that is, being in the range 80-FF, while the singletons were in the range 00-7F alone. The lead units and trail units were in the range A1 to FE, that is, the same as their range in the ISO 2022 encodings, but with the high bit set to 1.

On the PC (MS-DOS and Microsoft Windows platforms), two encodings became established for Japanese and Traditional Chinese in which all of singletons, lead units and trail units overlapped: Shift-JIS and Big5 respectively. In Shift-JIS, lead units had the range 81-9F and E0-FC, trail units had the range 40-7E and 80-FC, and singletons had the range 21-7E and A1-DF. In Big5, lead units had the range A1-FE, trail units had the range 40-7E and A1-FE, and singletons had the range 21-7E.

Unicode variable-width encodings

The Unicode standard has two variable-width encodings: UTF-8 and UTF-16. Originally, both Unicode and ISO 10646 standards were meant to be fixed-width. ISO 10646 provided a variable-width encoding called UTF-1, in which singletons had the range 00-9F, lead units the range A0-FF and trail units the range A0-FF and 21-7E. Because of this bad design, parallel to Shift-JIS and Big5 in its overlap of values, the inventors of the Plan 9 operating system, the first to implement Unicode throughout, abandoned it and it replaced it with a much better designed variable-width encoding for Unicode: UTF-8, in which singletons have the range 00-7F, lead units have the range C0-DF (now actually C2-DF, to avoid overlong sequences; see UTF-8 article), and trail units have the range E0-FD (now actually E0-F4, in synchronism with the encoding capacity of UTF-16). The lead unit also tells how many trail units follow: one for C2-DF, two for E0-EF and three for F0-F4.

UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000-D7FF and E000-FFFF, lead units the range D800-DBFF and trail units the range DC00-DFFF. The lead and trail units, called in Unicode terminology high surrogates and low surrogates respectively, map 1024×1024 or 1,048,576 numbers, making for a maximum of possible 1,114,112 codepoints in Unicode.

UTF-32, in contrast to the other two, is fixed-width.