[Clfs-support] Fw: [elug] utf-8 vs iso8859

Tue Jan 27 05:22:47 PST 2009

On Mon, Jan 26, 2009 at 08:34:45PM -0700, Randolph D Dach wrote:
> 
> I'm trying to set up the locale for  this computer and was wondering 
> 
> should 
> LC_ALL=en_CA , en_CA.utf8, or en_CA.iso88591
> 
> I've looked at the net for arguments for or against the above options but none of it makes much sense.  Does anyone out there know where I can find definitive information on the advantages/disadvantages to setting the locale to any of the above locales
> 
> Tks
> -- 
> Randy Dach
> Having fun as usual

 I've been using UTF-8 for a few years now.

 UTF-8 supports pretty-much every language (with current software
many african languages are in-practice poorly supported because
combining-accents don't usually work).  In an xterm or any graphical
application, you should be able to display glyphs from all languages
together.  In the console, support is slightly more restricted - a
maximum of 512 different glyphs in a screen font, so Japanese and
Chinese cannot be shown on a regular console.

 The legacy encodings support far fewer characters, and the various
latin/8859 encodings differ in their interpretation of what
character a particular value represents, so it is not possible to
mix e.g. hungarian (double acute accents on o and u : ő ű) or polish
(slash on l : ł, tail (ogonek) on e.g. e : ę ) with western european
variations (such as the tilde on e.g. n : ñ, or the scandinavian
letters such as ae : æ ).

 If you use gnome, UTF-8 has been preferred for several years.

 The major disadvantage of UTF-8 is that some people will persist in
using legacy encodings.  For me, that is not a significant problem -
I anyway get mail from windows users with strange \244 or whatever
characters (probably 'smart quotes' in some windows-specific
codepage).

 For text, UTF-8 files are of course a little larger but in an age
when most documents use xml the overhead of UTF-8 is not usually
significant.

 The other problem with UTF-8 is that if you have a glyph which you
can't render for lack of a suitable font, it takes a lot longer to
decode the multibyte character by-hand to work out what it's value
is in the conventional U+nnnn format.

ĸen (in this case, the first letter is the obsolete greenlandic kra,
which for all intents and purposes is similar to cyrillic or greek
lowercase k).
-- 
das eine Mal als Tragödie, das andere Mal als Farce