Unicode
Unicode is a standard for encoding computer text in most of the internationally used writing systems into bytes. It is promoted by the Unicode Consortium and based on ISO standards. Its goal is to replace current and previous character encoding standards with one worldwide standard for all languages, and has already done to a large degree, e.g. its dominant on the web (in the form of the UTF-8 encoding). New versions are issued every few years and later versions have over 144,000 characters, covering 159 modern and historic scripts, as well as symbols, emoji, and non-visual control and formatting codes.
Unicode was developed in the 1990s and integrated earlier codes used on computer systems.
Unicode provides many printable characters, such as letters, digits, diacritics (things that attach to letters), and punctuation marks. It also provides characters which do not actually print, but instead control how text is processed. For example, a newline and a character that makes text go from right to left are both characters that do not print.
Unicode considers a graphical character (for instance é) as a code point (alone or in sequence [e+ ‘] ). Each code point is a number with many digits which can be encoded in one or several code units. Code units are 8, 16, or 32 bits. This allows Unicode to represent characters in binary.
The Unicode Standard includes more than just the base code. Alongside the character encodings, the Consortium's official publication includes a wide variety of details about the scripts and how to display them: normalization rules, decomposition, collation, rendering, and bidirectional text display order for multilingual texts, and so on.
Unicode can be implemented by different character encodings. The Unicode standard defines Unicode Transformation Formats (UTF): UTF-8, UTF-16, and UTF-32, and several other encodings. The most commonly used encodings are UTF-8, UTF-16; GB18030, while not an official Unicode standard, is standardized in China and implements Unicode fully.
Some letters such as the Devanagari kshi, and national flag emojis, are represented with more than one code points.
Encodings
There are different ways to encode Unicode, the most common ones are:
- UTF-8, uses one to four bytes for each code point, maximizes compatibility with ASCII
- UTF-16, uses one or two 16-bit code units per code point, cannot encode surrogates
UTF-8 is the most common of these for exchange. It is used for internet, electronic mail, and Java also uses a variant of it.
Problems

Other websites