Wikipedia talk:WikiProject Typography/Unicode

Initial discussion

This discussion was moved here from Wikipedia talk:WikiProject Typography#Unicode tables.

Hi folks. Some articles contains table grids of Unicode glyphs. Each grid row contains information on several Unicode code points. Each grid cell contains several sub-elements for a particular code point. See, for example, Letterlike Symbols. I find this layout extremely confusing and hard to use. I would like to change it to a more linear table, with one code point per row. Something roughly like the table in Miscellaneous Symbols. Comments? Objections? —DragonHawk (talk|hist) 18:34, 10 October 2008 (UTC)[reply]

I like the presentation used in the Miscellaneous Symbols article, but not the implementation. When the point is to show the shape of a character, we should not depend on the correct font being installed on the user's computer, since if his font is incorrect he will be misled into thinking it is correct. Instead, we should use graphics to draw the character. I haven't figured out how to embed a graphic in a table, so I made the table into a graphic in OCR-A font. John Sauter (talk) 03:46, 28 October 2008 (UTC)[reply]

Since writing the above paragraph I have figured out how to include the shape of a character in a table. Briefly, I use Inkscape to create a 1000-point character, convert it to a path, save the .svg file, upload it to Wikimedia, and reference it using the Image: prefix, with a size of 10 pixels. If the reader clicks on the graphic, he sees a very detailed representation. Here is an example: . John Sauter (talk) 12:53, 3 November 2008 (UTC)[reply]

Some of the Unicode tables do provide separately rendered images of each glyph. When available, I think such should be included, for the same reasons you identify. However, I think we should also include the "as is" Unicode character, so that browsers which can render it natively do so. That also makes it possible to use the articles as a copy-and-paste reference (like Charmap). · I'll work on a table design that incorporates everything, and post here with progress. —DragonHawk (talk|hist) 15:24, 3 November 2008 (UTC)[reply]

Okay, here's my first pass at it, using Letterlike Symbols as a pilot case. I think this new layout is much better than the old layout, both for the reader and the editor. With one codepoint per row, it's easier for the eye to correlate fields and the current character. It can also be a sortable table. For the editor, the markup is much cleaner and easier to work with. · It does tend to leave a lot of whitespace to the right of the table on wider screens. I may eventually try tackling that with CSS columns. One thing at a time. :) · I did the conversion using a purpose-written Perl script. Hopefully, I can re-use for additional conversions, making them almost easy. · Suggestions, comments, commendations, condemnations? —DragonHawk (talk|hist) 05:55, 4 November 2008 (UTC)[reply]

I do like your new layout, but I think it would be better still if the Char and Image columns were described in more detail. Perhaps "The Char column shows how your browser renders the character; for obscure characters you might see a box containing the hexadecimal code in small type, question marks, or nothing at all. The Image column shows the character rendered using the (something) font." If you are using several different fonts to image all the characters, that should be explained.

Also, I think the Hex column should be formatted as "U+(hex number)" or "\u(hex number)" since that is how these numbers will be used. Similarly, the decimal column could be "&u(decimal number);". I think that change would also allow the Hex column to sort. See OCR-A Regular characters for an example of a sortable Hex column. John Sauter (talk) 15:38, 5 November 2008 (UTC) John Sauter (talk) 16:34, 28 November 2008 (UTC)[reply]

I just realized I never responded to this. Point by point:

I plan on coming up with some kind of boilerplate description of the table columns, to cover what you mention, and more. Probably as a template, to save effort and keep consistency. But I planned on waiting until all conversions were done. Something might come up.
I didn't render those images myself; I just used images already existing on Wikipedia. Worrying about the correctness or source of those images is outside the scope of my effort. If you want to attack that aspect, please do! :)
- It would be really nice to have SVGs of all the Unicode characters, but I'm not sure of the copyright issues around that. I think we'd need to use a source font that was GFDL or CC compatible. I expect you know more about this than I do.

I doubt I know more than you, but my opinion is that an image of a single character, or a meaningless list of characters like abcdefghijklmnopqrstuvwxyz, cannot be copyrighted since there is no "creative content" beyond the shape of the character. At least in the United States, fonts cannot be copyrighted. However, keep in mind that the concept of "image of a Unicode character" is flawed, since a character does not correspond to a particular shape. You need both a character and a font to get a shape. John Sauter (talk) 16:39, 1 December 2008 (UTC)[reply]

- If you do end up doing work on this, I would suggest a file name format of U+xxxx.svg, to be compatible with existing images of that format.

Unfortunately, different vendors have placed glpyhs at different code points, at least with OCR-A. That is why I have chosen to use the character's name rather than its code point when creating the image. John Sauter (talk) 16:30, 1 December 2008 (UTC)[reply]

Syntax such as "U+", "\u", "&u", etc., is specific to the context. HTML isn't the same as Perl, etc. That's why I went with no prefix for decimal, and the WP:MOSNUM recommendation for hex. The values are universal.
However, you're right in that the 0x prefix breaks table column sorting. I didn't even realize that. I've adjusted Letterlike Symbols to just give the four character hexadecimal value, without any prefix, and it sorts properly now. The table makes it clear these are hex, so no prefix is needed. And no-prefix is also more universal.

Make sense? Anyone else have anything they'd like to say? I'll start attacking more pages Real Soon Now, if there are no objections. —DragonHawk (talk|hist) 06:39, 1 December 2008 (UTC)[reply]

continuation of discussion

(This part of the discussion took place after it was moved to a separate page by DragonHawk on 22 June 2009.)

Hex prefix

I continue to disagree with the removal of the U+ prefix from the hex column. It does not prevent sorting (though other prefixes do) and it is not terribly difficult to copy just the hex digits to another context when doing copy and paste. In favor of keeping the U+ prefix is that it is used by the Unicode standard to designate a Unicode character. John Sauter (talk) 05:07, 22 June 2009 (UTC)[reply]

Sorry, I didn't realize you (or anyone else) felt that strongly about it. Originally you had just said the numbers should be prefixed, so that seemed a lot less directed, especially when some of the suggested prefixes implied a context like HTML, or prefixes for decimal expressions. If it will sort properly, I think it's reasonable to go with what the Unicode standard uses. • I'd like to see thoughts from more people, just on general principles, but this is probably sufficiently esoteric for that to be unlikely. —DragonHawk (talk|hist) 11:59, 22 June 2009 (UTC)[reply]

Font choice

In addition, there is the problem of choosing a font. Some Unicode characters are so obscure that probably nobody would know whether they had been rendered using Bitstream Vera Serif or Century Schoolbook L, but a standard for describing Unicode characters must deal in a reasonable way with all Unicode characters, not just the obscure ones. I suggest that the default font for displaying a Unicode character should be the FreeSerif font distributed with OpenOffice. It seems to match the images published in the Unicode standard reasonably well, and it contains quite a lot of characters. Of course, for the obscure characters we are lucky to find any font which contains the character; I am only suggesting that FreeSerif be the display font when it is a reasonable choice. John Sauter (talk) 05:07, 22 June 2009 (UTC)[reply]