Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Unicode vs The Rest Of The World (Again) (was Re: Re: Le tilde a-t-il été utilisé en français?)

From:Garth Wallace <gwalla@...>
Date:Saturday, May 1, 2004, 4:42
Paul Bennett wrote:

> On Fri, 30 Apr 2004 17:38:02 -0700, Garth Wallace <gwalla@...> > wrote: > >> Paul Bennett wrote: >> >>> It's only a small subset of Unicode that gets mangled, rather than every >>> character (we've seen it on the Georgian alphabet, notably), at least >>> with >>> UTF-8. UTF-8 is not merely raw Unicode, but rather a set of multi-byte >>> codes, only some of which lie within the deadly 128-150 range. >> >> >> Ah, so it's only the Unicode characters that contain bytes matching >> ASCII control characters with the 8th bit set that get mangled. Okay. > > > Not Unicode characters. UTF-8 strings, which are not the same thing. A > UTF-8 string can be one or more bytes long (bytes which are kinda supposed > to be "safe" bytes to pass), and resolves mathematically to a single > Unicode character. See my example below.
I was using "character" to mean a sequence of bytes corresponding to an abstract Unicode character. I do understand how Unicode works.
>>> Should anyone post in pure UTF-16, I imagine the problem might manifest >>> itself more often, especially if they use the right (or wrong?) Unicode >>> pages. >> >> >> Yeah, UTF-16 interpreted as ASCII would be chock-full of nulls. > > Nulls that any sensible software[*] would simply either skip or print as a > non-spacing space.
*Really* sensible software wouldn't be treating UTF-16 as ASCII. :P But yeah.