Specials (Unicode block)
This article needs additional citations for verification. (April 2010) |
Specials is the name of a short Unicode block allocated at the very end of the Basic Multilingual Plane, at U+FFF0–FFFF. Of these 16 codepoints, 5 are assigned as of Unicode 5.0:
- U+FFF9 "INTERLINEAR ANNOTATION ANCHOR", marks start of annotated text
- U+FFFA "INTERLINEAR ANNOTATION SEPARATOR", marks start of annotating text
- U+FFFB "INTERLINEAR ANNOTATION TERMINATOR", marks end of annotating text
- U+FFFC "OBJECT REPLACEMENT CHARACTER", placeholder in the text for another unspecified object, for example in a compound document.
- U+FFFD "REPLACEMENT CHARACTER" used to replace an unknown or unprintable character
U+FFFE and U+FFFF are not unassigned in the usual sense, but guaranteed not to be a Unicode character at all. They can be used to guess a text's encoding scheme, since any text containing these is by definition not a correctly encoded Unicode text. The U+FEFF is Unicode's byte-order mark, named "zero-width no-break space" (as inclusion of it in text shall not be noticed). If this character is read in the wrong byte order (for example, due to an endianness bug), it will read 0xFFFE, which is illegal Unicode.
Replacement character
The replacement character � (often a black diamond with a white question mark) is a symbol found in the Unicode standard at codepoint U+FFFD in the Specials table. It is used to indicate problems when a system such as a text parser was not able to decode a stream of data to a correct symbol.
Consider a text file containing the German word für
in the Windows-1252 encoding. This file is now opened with a text editor that has UTF-8 as the preset encoding. As the first character (0x66
) is within the code range 0x000000–0x00007F
, UTF-8 correctly interprets it as an f. The second character (0xFC
) translates to binary 1111 1100, which is not a reasonable value for any UTF-8 encoded data. A text editor could therefore now insert the replacement character symbol to warn the user that something went wrong. The last, character (0x72
) now is within the code range 0x000000–0x00007F and can be decoded correctly. The whole string now looks like this: f�r
.
If this file now is saved in UTF-8 form, the text file data will look like this: 0x66 0xEF 0xBF 0xBD 0x72
, which will be displayed in Windows-1252 again as f�r
(see mojibake).
Once data was transformed as in the example above (different symbols replaced with a single replacement character), there is no trivial way other than manually finding and replacing the correct character from context to get back the original data.
Some websites[weasel words] specify their used encoding incorrectly to UTF-8 rather than, for example, the actually used Windows-1252. In some web browsers (such as Firefox), this results in all umlauts, ß's and some other characters in the higher range of Windows-1252 (with the most significant bit set to 1) being displayed as � instead. Other web browsers such as new versions of Internet Explorer try their best in figuring out which code page was meant to be used.