Jump to content

Talk:Comparison of Unicode encodings

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Lloydsargent (talk | contribs) at 14:08, 24 April 2008. The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

CJK characters uses 3 bytes in UTF-8?

This article states that "...There are a few, fairly rarely used codes that UTF-8 requires three bytes whereas UTF-16 requires only two..."; but it seems to me that most CJK characters take 3 bytes in UTF-8 but 2 bytes in UTF-16?76.126.165.196 (talk) 08:32, 25 February 2008 (UTC)[reply]

Requested move

This article appears to be a sub-page of Unicode, which is ok; but it should have an encyclopedic name that reflects its importance (that of an article on Unicode encodings, rather than some evaluative comparison). —donhalcon 16:26, 7 March 2006 (UTC)[reply]

It should be moved to Unicode encodings. Once that's done, the opening sentences should be redone to inform readers on the basic who/what/why. --Apantomimehorse 10:11, 10 July 2006 (UTC)[reply]

UTF-24?

hex 110000, the grand total of 17 Planes, obviously takes 21 bits, which comfortably fit into 3 bytes (24 bits). So why would anyone want to encode 21 bits in 32 bits? the fourth byte is entirely redundant. What, then, is the rationale behind having UTF-32 instead of "UTF-24"? Just a superstitious fear of odd numbers of bytes? dab () 12:47, 6 July 2006 (UTC)[reply]

It's more than superstitious fear of odd numbers of bytes - it is a fact that most computer architectures can process multiples of bytes equal to their word size quicker. Most modern computers use either a 32 bit or 64 bit word. On the other hand, modern computers are fast enough that the speed difference is irrelevant. It is also true that most computer languages provide easy ways to refer to those multiples. (For example, in C on a 32 bit machine, you can treat UTF-32 in the machines's native byte order as an array of integers.) --LeBleu 23:01, 7 July 2006 (UTC)[reply]
Why not ask why we don't have UTF-21, since the last three bits in UTF-24 would be entirely redundant? Same issue, basically, but on a different scale (the hypothetical UTF-21, if actually stored as 21-bit sequences, would be much slower to process without noticeable size gain). Word sizes tend to be powers of two, so if data can be presented as (half)-word sized at little extra cost, this will be done unless there are overriding reasons of space economy. And if you want space economy, you should use UTF-16 anyway, since the extra processing power you must pay for characters outside the BMP is (usually) not significant enough to warrant using twice as much storage.
Nothing actually prohibits you from layering another encoding over UTF-32 that stores the values in three bytes, as long as you supply the redundant byte to anything that advertises itself as processing UTF-32. This is unlikely to be of much advantage, though. 194.151.6.67 11:36, 10 July 2006 (UTC)[reply]
so the fourth byte is really redundant, and hangs around in memory for faster processing speed. I imagine that all UTF-32 files will have to be compressed as soon as they are stored anywhere; the question then is, which is more of a waste of processing power, compressing and uncompressing the files, or adding a zero byte at read-time before further processing? UTF-8 is only economical if the overwhelming majority of characters are in the low planes. Assume (for argument's sake) a text with characters evenly distributed in the 17 planes: UTF-8 would be out of the window, but 'UTF-24' might have an advantage over UTF-32 (obviously "UTF-21" would be even more economical, but that would really mean a lot of bit-shifting). dab () 18:14, 29 July 2006 (UTC)[reply]
To answer your direct question adding an extra byte after every 3 will be far far less processing than implmenting something like deflate having said that i can't see many situations where you would do it.
Text with characters evenly distributed among the planes is going to be very very rare. Only 4 planes have ever had any allocations at all (BMP, SMP, SIP and SSP), only two of those contain character ranges for complete scripts (the other two are rare CJK ideographs and special control codes) and most texts will be highly concentrated on a few small ranges.
If you are concerned with storage space and you are dealing with a lot of non-bmp characters in your text (say an archive of tolkins tengwar and kirth manuscripts) then you will have to choose between possibilities such as a custom encoding, compressing encodings like SCSU and BOCU and general purpose compression algorithms like deflate. With most systems however even if individual documents are non-bmp the overwhelming majority of characters in the system as a whole are in the BMP.
A final point, if heavy use is made of HTML or XML or similar markup languages for formatting the ascii characters of the markup can easilly far outnumber the characters of the actual document text. Plugwash 23:06, 29 July 2006 (UTC)[reply]

UTF-7,5 ?

See this page [1] which describes the encoding. Olivier Mengué |  23:19, 22 May 2007 (UTC)[reply]

Question?

So what is the most popular encoding??? —Preceding unsigned comment added by 212.154.193.78 (talk) 07:52, 15 February 2008 (UTC)[reply]

UTF-8 is popular for latin-based text, while UTF-16 is popular for asian text. And everyone hates UTF-32 ;-) 88.68.223.62 (talk) 18:32, 27 March 2008 (UTC)[reply]
Not really, while UTF-8 is more compact than UTF-16 for most alphabetic scripts and UTF-16 is smaller than UTF-8 for CJK scripts then UTF-8 the descision is often based on considerations other than sise (legacy encodings are also commonly used but we will focus on unicode encodings here).
In the unix and web worlds UTF-8 dominates because it is possible to use it with existing ascii based software with little to no modification. In the windows NT .net and java worlds UTF-16 is used because when those APIs were designed unicode was 16 bit fixed width and UTF-16 was the easiest way to retrofit unicode support. There are one or two things that use UTF-32 (I think python uses it under certain compile options and some C compilers make wchar_t 32 bit) but mostly it is regarded as a very wastefull encoding (and the advantage of being fixed width turns out to be mostly an illusion once you implement suport for combining characters). Plugwash (talk) 21:43, 4 April 2008 (UTC)[reply]

Mac OS Reference

This seems to be a bit out of date. I just searched the reference library and can not come up with anything in the current version of Mac OS regarding UTF-16. Since the cited material is two revisions (10.3 vs. the current 10.5) AND since Mac OS has understands UTF-8, the fact that it uses UTF-16 in a previous version for INTERNAL system files, is irrelevant. I suggest this be removed. Lloydsargent (talk) 14:08, 24 April 2008 (UTC)[reply]