Jump to content

Unicode and HTML

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by 63.192.137.xxx (talk) at 23:43, 17 October 2001 (rewrote a couple of sentences). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

HTML 4.0 uses Unicode as its official character set. Usually though, an 8-bit character encoding is used that can only represent a small slice of this set. It is still possible to have characters from the whole of Unicode inside an HTML document by using a numeric character entity reference &#N;, where N is a decimal number for the Unicode code point, or a hexadecimal number prefixed by x. (Note that the use of hexadecimal in this context is more recent, and therefore less widely supported, than the use of decimal.) There is also a standard set of named character entity references for commonly used symbols outside of some character encodings, so one can use —, for example, to represent an em dash—like this—in text even if the character encoding used doesn't contain that character.


Many browsers, though, are only capable of displaying a small subset of the full UCS-2 repertoire. For example, the codes Δ Й ק م ๗ ぁ 叶 葉 냻 display on your browser as Δ, Й, ק, م, , , , and which ideally look like the Greek letter "Delta", Cyrillic letter "Short I", the Hebrew letter "Qof", the Arabic letter "Meem", Thai numeral 7, Japanese Hiragana "A", simplified Chinese "Leaf", traditional Chinese "Leaf", and a Korean syllable, respectively. Some multilingual web browsers that dynamically merge the required font sets on demand, e.g., Microsoft's Internet Explorer 5.5 on Windows, are capable of displaying all the Unicode characters on this page simultaneously after the appropriate "text display support packs" are downloaded. MSIE 5.5 would prompt the users if a new font were needed via its "install on demand" feature. Other browsers such as Netscape Navigator 4.77 can only display text supported by the current font associated with the character encoding of the page. When you are using the latter type of browser, it is unlikely that your computer has all of those fonts, nor the browser can use all available fonts on the same page. As a result, the browser will not display the text above all correctly, though it may display a subset of them. Because they are encoded according to the standard, though, they will display correctly on any system that is compliant and does have the characters available. Further, those characters given names for use in named entity references are likely to be more commonly available than others.


/TableCodesFrom128To999


/TableCodesFrom1000To1999


Note: Tables show decimal codes; hexadecimal codes should be shown as well, because they are used in the printed version of the Unicode-Manual.

Additional Note: How should these tables be labeled and where should they be put? The division should probably be along the blocks and not just blocks of 1000s.


I think the page names should use 4-digit hexadecimal (and 8-digit beyond the Basic Multilingual Plane, if you want to go there). The division would make more sense if it took blocks into account. Also, you need to omit things like 0080-009F (control characters) and D800-DFFF (surrogates).


/Latin1_Supplement

/Latin_Extended_A

/Latin_Extended_B

/IPA_Extensions

/Spacing_Modifier_Letters

/Combining_Diacritical_Marks

/Cyrillic

/Hebrew

/Arabic


See also:

Wiki_special_characters