Jump to content

Talk:Unicode and HTML

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Mzajac (talk | contribs) at 07:22, 31 January 2005 ("IE5 was the first to use glyphs from 'best available' fonts"). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

For the Unicode character charts, reverted the URL from http://www.unicode.org/charts/normalization/ back to http://www.unicode.org/charts/. The normalization charts only display the characters if you have the font already installed and do not seem to be as complete as the full charts available on the other URL. --Nate 15:37 Mar 7, 2003 (UTC)


gg

Hmm, I wasn't sure whether it was really more "intuitive" to use decimal instead of hexadecimal. I mean, both alternatives obscure the character, and the argument that "older web browsers" fail to parse hexadecimal is moot IMO since those web browsers will have a problem with non-8-bit characters anyway. I felt NPOV would dictate to say that in HTML, both hexadecimal and decimal can be used, so I changed that. Djmutex 10:51 May 2, 2003 (UTC)


The intro says: HTML 4.0 uses Unicode as its official character set.

Does somebody have a link to the place in the specification where this is stated? -- Hirzel

Section 5.1. The use of the term "character set" is misleading because it is so overloaded, but it is accurate. An HTML document must consist of Unicode characters. Those characters are, in turn, encoded (as iso-8859-1, utf-8, etc.). Today I added some text to the article to clarify this point. - mjb 23:48, 19 Aug 2004 (UTC)
I believe that statement is wrong. If I understand http://www.w3.org/TR/html401/charset.html#encodings correctly, it explicitly says that there is no "default" encoding. Instead, "conforming user agents" must be able to map any HTML document to Unicode (for example, to support all defined HTML named entities), and may apply heuristics on the HTML document if no charset is specified explicitly (either in the HTTP header, or a META tag, or a "charset" attribute on a specific element.) -- djmutex 21:20 10 Jun 2003 (UTC)
Re myself, the statement isn't really wrong, but it might be misleading. All Unicode characters must be supported by a browser, but there is no "default" character set. As a result, Unicode is the "official" character set, but it's not the default. djmutex 21:23 10 Jun 2003 (UTC)
Well, sorta, but not really. There is confusion resulting from the unfortunate use of the overloaded term "character set" and your apparent misunderstanding that Unicode is itself an encoding in the same sense as a "charset" (it's not).
HTML documents are indeed comprised of Unicode characters, always, but Unicode characters are abstract concepts: just "the idea of" a unit in a writing system, mapped to a similarly abstract concept of a non-negative integer number, its "code point". An HTML or XML document is defined as being a sequence of such characters, and therefore is itself an abstract entity. It is only when the document manifests as a sequence of bits/bytes on the network, on disk, or in memory that it has an encoding associated with it. The encoding maps the characters/code points to sequences of bits. You're right that there is no default encoding, at least in the HTML spec itself, but depending on how the document is transmitted, there may be a default of us-ascii or iso-8859-1 (RFCs 2616 and 3023 address this topic). I've modified the article somewhat to explain this without going into too much detail; there are articles devoted to these topics and we don't need to repeat them in depth here. -- mjb 23:48, 19 Aug 2004 (UTC)

Unicode display in Windows MSIE

Some multilingual web browsers that dynamically merge the required font sets on demand, e.g., Microsoft's Internet Explorer 5.0 and up on Windows, or Mozilla/Netscape 6 and up cross-platform, are capable of displaying all the Unicode characters on this page simultaneously after the appropriate "text display support packs" are downloaded. MSIE 5.5 would prompt the users if a new font were needed via its "install on demand" feature.

All of the characters in the table display correctly on my Mac's Safari and Firefox (thanks partly to Code2000 and Code2001 fonts). But my stock Windows XP installation doesn't show the last six letters in MSIE 6.0 or Firefox 1.0, and doesn't prompt me to do anything. Is the above passage incorrect, or is there something wrong with my Windows or Explorer? Michael Z. 00:35, 2005 Jan 20 (UTC)

What is a "text display support pack". That phrase doesn't appear on the Internet, except for this page. Michael Z. 14:20, 2005 Jan 20 (UTC)

The sentence in the article only states, that the browsers are able to switch between fonts, if these are installed. So your stock XP system doesn't have enough or the right fonts, I assume.
Also note, that the method the Mozilla switches, is more flexible. It will switch for a single missing diacrical character to another font if necessary. Ugly but better than nothing. See the Nirvana article for examples.
About the exact meaning of "" I'm wondering myself.
Pjacobi 21:47, 2005 Jan 20 (UTC)
I'm still confused about that passage. I've been editing pages with Old Cyrillic and IPA characters on them. Windows users complain that they can't see some of the characters unless we put them in a <span> with Arial Unicode MS as the first font choice. The characters are supported by a font present in Windows, but I see no "dynamic merging on demand", and no "install on demand" prompting. I would rewrite the description, but I don't know much about Windows and maybe the original author knows something that I don't know.
In the Nirvana article MSIE/Win shows the a-macrons. Firefox/Win also shows the n-dots and m-dots, but the font seems to match up with the rest of the page just fine. Both Mac browsers show all of that, plus the Chinese. But on the Mac, the n-dot in isn't bold-faced where it should be—like the Moz method you describe.
Michael Z. 00:45, 2005 Jan 21 (UTC)


AFAIK the "dynamic font switching" in MSIE is only a lookup up code ranges and languages to fonst. Fonts which aren't in these lookup tables are never considered for display. Now, if a poor piece of text is mapped to font X by these MSIE tables, all codepoints not covered in X will just not display! So, for a bad solution of this problem, MSIE users want an explicit font tag.
In contrast Mozilla switches based on Codepoint availability in the font. I have a rather plain default font set, and the b-dots an dm-dots in Nirvana are displayed in the also installed Code 2000 by Moz.
Pjacobi 08:50, 2005 Jan 21 (UTC)
So MSIE/Win just chooses fonts based on the page's charset, or the specified lang? Does it honour lang attributes on DIVs, SPANs or other elements?
In contrast, Moz chooses fonts based on every single character on the page. Once I figure this out, I'll rewrite that paragraph, because the two browser's behaviour definitely can't be summed up as the same thing. Michael Z. 2005-01-21 17:32Z
It's still a bit guesswork, so some tests or a really knowledgeable source is needed. My current hypothesis: MSIE/Win can mix different fonts on page, using explicit fonts and (I suppose lang tags). And (IMHO) it looks at the actual characters, but not to find a font really including them (I'd say it never asks a font which characters it supports), but only to switch to the correct "block". A chinese character will switch it to the font configured for chinese (without looking whether that character is really included), but an m-underdot, if at all, only switches to the standard Unicode font. Sorry for the confusion, but at least, I haven't programmed it. --Pjacobi 22:52, 2005 Jan 21 (UTC)

"IE5 was the first to use glyphs from 'best available' fonts"

Mjb, I don't know what Microsoft calls it, but it doesn't pick the right fonts to display all the characters on the page, the way other modern browsers do.

You'll notice that in many places in Wikipedia editors have added code like style="font-family:Arial Unicode MS, Lucida Sans Unicode, sans-serif;" to tables displaying Unicode characters. We have had to develop Template:IPA (documentation) and Template:Polytonic to display IPA and Polytonic Greek characters in MSIE. These are all hacks, aimed only at MSIE on Windows. On a stock Mac or Windows system the necessary fonts are present, and Safari and Firefox display all these characters. But MSIE displays little squares, unless web authors guess which fonts the system might have and specify them in each and every instance where these Unicode characters appear. Michael Z. 2005-01-31 07:22 Z