Non-ASCII (UTF8) characters
Added by Vincenzo Romano almost 14 years ago
I'm encountering a strange behavior, possibly due to my errors.
If I put UTF8 (this is my system wide locale) characters into a WText (or ->setText()) from a constant string, I get non-ASCII characters mangled in the web output. More precisely I get a couple of question marks instead of the non-ASCII character.
Those characters get properly displayed, instead, when used in ->setHeaderData().
Using the HTML character entity representation doesn't help.
I've read that wt is UTF8-aware, somehow, and this is puzzling to me. Please, consider the following as code samples.
data->setHeaderData( 5,any( (const char*) "Tschüß" ) ); // This works fine
...
head->addWidget( new WText( "Tschüß" ) ); // This one doesn't
What am I missing?
P.S.
The above word "Tschüß" should be displayed ending with an "umlaut u" and a "Eszett" not with weird characters or four question marks.
Replies (4)
RE: Non-ASCII (UTF8) characters - Added by Wim Dumon almost 14 years ago
First of all, it's better not to write UTF-8 string literals in C code, because that is not portable. I've posted this on the mailing list before, I'm sure that searching for 'unicode' or 'utf-8' in the archives will show some results of interest.
It's not because your system locale is UTF-8 encoded, that this is also your default locale in C. By default it is the 'C' locale. To set your global C locale to your system locale write this:
std::locale::global(std::locale(""));
And lastly, the recommended way is to specify the locale of your strings explicitly. For UTF-8, use WString::fromUTF8():
data->setHeaderData( 5,any( (const char*) "Tschüß" ) ); // This works fine
...
head->addWidget( new WText( WString::fromUTF8("Tschüß") ) ); // This one doesn't
When a boost.any char* is encountered, Wt assumes that it is UTF-8 encoded and will call WString::fromUTF8() automatically.
It's not unthinkable that we will modify the default WString locale to be UTF-8, rather than the current behaviour (the global C locale) - hopefully all other character encodings will die in the long run.
BR,
Wim.
RE: Non-ASCII (UTF8) characters - Added by Vincenzo Romano almost 14 years ago
Thanks,
std::locale::global(std::locale(""));
did the "magics".
In my humble opinion, I would set the default system locale ... by default, whatever it is.
I would not force a specific locale, despite UTF8 by default makes sense, because it could be different from the system one.
RE: Non-ASCII (UTF8) characters - Added by Wim Dumon almost 14 years ago
I don't know the standard by heart but I assume it is a compiler implementation detail to set the 'default' locale. On linux/gcc, it's the C locale, but it wouldn't surprise me that it is something else on Windows/MSVS. In any case, I think you'll agree that Wt, as a library, should not modify the global C locale for you.
Wt allows and should always allow you to specify the locale if you convert a char * to a WString, and we encourage you to do so by invoking std::fromUTF8() explicitly (or specify an std::locale as parameter to the WString constructor). I was only referring to change the default behaviour, which we'd only do in a major version change and with a proper amount of warnings since this is effectively breaking the current API.
Bottom line I'm happy to hear your problem is solved!
Wim.
RE: Non-ASCII (UTF8) characters - Added by Vincenzo Romano almost 14 years ago
While I agree that Wt, as a library, should not modify the global C locale, I would say that it could simple make sure that the one in use the is the one defined by the underlying system and not by the compiler.
Actually it's all in the definition of "global locale". Global to the system or global to the compiler's run-time support.
Anyway, I think you are right in leaving the things as are: it's up to the programmer to build her own defined environment (and locale).
Bottom line: Wt rocks!