Project

General

Profile

Actions

Bug #3365

closed

WPdfRenderer and UTF-8

Added by Michael Shestero almost 10 years ago. Updated over 9 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
Start date:
06/21/2014
Due date:
% Done:

0%

Estimated time:

Description

My task is to create PDF with tables and I try to use WPdfRenderer.

I cannot make WPdfRenderer to render UTF-8 characters.

I use libharu-RELEASE_2_3_0RC2 sources patcher with truetype_utf8.patch from here:

https://groups.google.com/forum/#!msg/libharu/YzXoH_K3OAI/hxinqnq5Um0J

The libharu itself do output UTF-8 characters. But WPdfRenderer don't!

Please help!

See my sample program below:

#include <locale.h>
//#include <boost/bind.hpp>

#include <iostream>

#include <Wt/Render/WPdfRenderer>
#include <hpdf.h>

using namespace Wt;

extern "C" {
  HPDF_STATUS HPDF_UseUTFEncodings(HPDF_Doc pdf);
}

void error_handler(HPDF_STATUS   error_no,
           HPDF_STATUS   detail_no,
                   void         *user_data)
{
  std::cerr << "libharu error: error_no="
            << (unsigned int) error_no
            << ", detail_no="
            << (int) detail_no
            << std::endl;
}

void testTwo()
{
  HPDF_Doc pdf = HPDF_New(error_handler, 0);

  HPDF_SetCompressionMode(pdf, HPDF_COMP_ALL);

  // Note: UTF-8 encoding (for TrueType fonts) is only available since libharu 2.3.0 !
  HPDF_UseUTFEncodings(pdf); // will register encoder "UTF-8"
  HPDF_SetCurrentEncoder(pdf, "UTF-8");  // hpdf_encoder_utf.c

  const char* font_name =
          // HPDF_LoadTTFontFromFile(pdf, "/usr/local/fonts/arial.ttf", HPDF_TRUE);
          // HPDF_LoadTTFontFromFile(pdf, "/usr/share/fonts/truetype/freefont/FreeSans.ttf", HPDF_TRUE);
          HPDF_LoadTTFontFromFile(pdf, "/usr/share/fonts/truetype/freefont/FreeMono.ttf", HPDF_TRUE);

  std::cout << "Font name: " << font_name << std::endl;
  HPDF_Font font = HPDF_GetFont(pdf, font_name, "UTF-8");


  HPDF_Page page = HPDF_AddPage(pdf);
  HPDF_Page_SetFontAndSize(page, font, 15.0);

  HPDF_Page_SetSize(page, HPDF_PAGE_SIZE_A4, HPDF_PAGE_LANDSCAPE);


  HPDF_Page_BeginText (page);
  HPDF_Page_TextRect (page, 80.0,80.0,530,430,
                            "üöäÄÜÖß AND русский", //  ok!
                            HPDF_TALIGN_LEFT, NULL);
  HPDF_Page_EndText (page);

  Render::WPdfRenderer renderer(pdf, page);

  // Only needed if Wt is not linked against libpango
  renderer.addFontCollection("/usr/share/fonts/truetype");

  renderer.setMargin(2.54);
  renderer.setDpi(96);
  renderer.setFontScale(1);

  renderer.render(
            "<p>Copiright mark: &copy;</p><p>Russian text here: <b>проверка</b></p>",
          0 );
  // copyright mark also don't render proprely  :-(

  HPDF_SaveToFile(pdf, "xhtml2pdf.pdf" );
  HPDF_Free(pdf);
}

int main(int argc, char **argv)
{
  // Installed locales: locale -a
  //setlocale (LC_ALL,"ru_RU.utf8");
  //std::locale::global( std::locale("ru_RU.utf8") );
  std::cout << "C   Locale is: " << setlocale(LC_ALL,NULL) << std::endl; // ok
  std::cout << "C++ Locale is: " << std::locale().name() << std::endl; // ok

  testTwo();
}

Files

xhtml2pdf.pdf (196 KB) xhtml2pdf.pdf Koen Deforche, 06/23/2014 11:56 AM
xhtml2pdf.pdf (196 KB) xhtml2pdf.pdf Koen Deforche, 06/23/2014 04:07 PM
u.pdf (200 KB) u.pdf Michael Shestero, 07/07/2014 02:32 PM
u.html (16.3 KB) u.html Michael Shestero, 10/06/2014 07:46 PM
Actions #1

Updated by Michael Shestero almost 10 years ago

PS Also I see warnings "WString: narrow(): loss of detail: ???" etc.

Sometimes I see another warning about some string cannot be widen.

Actions #2

Updated by Michael Shestero almost 10 years ago

I tried to rebuild Wt with WT_NO_STD_WSTRING option.

And got following:

 Building CXX object src/CMakeFiles/wt.dir/Wt/WStringUtil.o
/home/shestero/Downloads/wt-3.3.3/src/Wt/WStringUtil.C: In function ‘std::string Wt::fromUTF8(const string&, const std::locale&)’:
/home/shestero/Downloads/wt-3.3.3/src/Wt/WStringUtil.C:285:27: error: too few arguments to function ‘std::string Wt::fromUTF8(const string&, const std::locale&)’
/home/shestero/Downloads/wt-3.3.3/src/Wt/WStringUtil.C:283:13: note: declared here
/home/shestero/Downloads/wt-3.3.3/src/Wt/WStringUtil.C:285:33: error: ‘narrow’ was not declared in this scope
/home/shestero/Downloads/wt-3.3.3/src/Wt/WStringUtil.C: In function ‘std::string Wt::toUTF8(const string&, const std::locale&)’:
/home/shestero/Downloads/wt-3.3.3/src/Wt/WStringUtil.C:313:29: error: ‘widen’ was not declared in this scope
make[2]: *** [src/CMakeFiles/wt.dir/Wt/WStringUtil.o] Error 1
make[1]: *** [src/CMakeFiles/wt.dir/all] Error 2
make: *** [all] Error 2
Actions #3

Updated by Koen Deforche almost 10 years ago

Hey,

Specifying unicode text inside a text file isn't something that just works. The C standard does not specify how text is interpreted. There is a tendency towards standardizing to UTF-8 (in practice) but this is nowhere stated or portable.

Your test can easily be fixed to run correctly on my OS (ubuntu) but the only reliable way is to not have unicode literals inside the .C file but load them from an external file with a known encoding.

renderer.render(WString::fromUTF8(
                    "<p>Copiright mark: &copy;</p><p>Russian text here: <b>проверка</b></p>"),
          0);

Regards,

koen

Actions #4

Updated by Michael Shestero almost 10 years ago

Thank you.

In general I admit your admonition about UTF-8 strings in the C source.

But, sorry:

1) This isn't the answer why © not shown in my pdf. This is not UTF8 in the source case. You may suppose this this is a font problem, but I'm not suppose so (this is now common symbol in fonts!). Anyway I cannot solve, so I ask for help. (Already tried different fonts; always have warning "WString: narrow(): loss of detail: ?" even about ©)

2) Why I cannot compile Wt with WT_NO_STD_WSTRING option ?

3) I always use UTF8 in C sources and had no problems before; event libharu output my text into PDF correct. The problems are only with Wt!

Actions #5

Updated by Wim Dumon almost 10 years ago

Hello Michael,

It's not because it works on your system, that it's portable. Whether utf8 in your source code works, depends on your operating system and your compiler. We've seen weird behavior with UTF-8 in source code, ranging from unexpected behavior to compile errors. Most linux distributions will nowadays allow it, since they fully default to UTF-8. The same is not true for Windows.

With respect to unicode in Wt: Wt fully supports unicode in a clearly defined way, through the use of class WString. WString is fully unicode aware, internally data is stored as UTF-8. WString is used in Wt everywhere where a unicode string can be rendered in the browser. When you put a char* in a WString, you can (and should!) choose the character set of the input string, through parameters in the constructor of WString. If you omit the parameters, you will get Wt's default behaviour, which means that the char* will be interpreted as being encoded in the character set of the current C locale. The default C locale on linux is not UTF-8, it is ASCII (the C locale).

Since UTF-8 is becoming the standard for string encoding (but likely the default C encoding won't change), we have added an option in Wt so that you can change WString's behavior: it will by default interpret char* as being encoded as UTF-8, rather than the character set of the current C locale. This function is void Wt::WString::setDefaultEncoding(Wt::CharEncoding encoding). It makes sense on Linux to call Wt::WString::setDefaultEncoding(Wt::UTF8). But if you ignore the existence of character encodings, you will still encounter question marks or plain wrong results at regular times (e.g. when talking to some database, after reading a file, ...).

Best regards,

Wim.

Actions #6

Updated by Koen Deforche almost 10 years ago

Hey,

As you can see the output of the test case works correctly for me so there must be something specific to your environment.

My guess is that your Wt installation isn't picking up the TTF font, since ::narrow() indicates it cannot use UTF-8 which is only the case when it needs to fallback to the default PostScript fonts.

There are two ways TTF font's are being picked up by Wt:

1) Using pango, if Wt was built with pango support. You can see if this is being done by checking the messages output while building Wt.

2) By using addFontCollection() to provide Wt directly with a font collection. In this case font-selection isn't particularly clever about picking a font with the appropriate glyphs.

Since you seem to be doing the second thing already, I can only assume that it fails because pango somehow isn't configured properly. Do you see any error messages that provide a useful hint?

Regards,

koen

Actions #7

Updated by Michael Shestero almost 10 years ago

Thank you. I appreciate your support.

Still I am lost.

But I've got a hint! Paying more attention to my pdf I noticed that HPDF_Page_SetFontAndSize doesn't set the font for WPdfRenderer!

Yet it affects HPDF_Page_TextRect! WPdfRenderer use some other non-monospaced font.

How do you comment that? How can I set font for WPdfRenderer?

Yes, Pango now disabled in my Wt. I turned it off trying to solve this problem.

PS Note that there is no UTF-8 in my source now. There is only & c o p y ; letters!

PSPS Could anybody tell something about WT_NO_STD_WSTRING ?

Actions #8

Updated by Michael Shestero almost 10 years ago

PSPSPS In pdf you posted, fellows, also two different fonts!

Actions #9

Updated by Koen Deforche almost 10 years ago

I use pango, so yes, there may be two fonts. Pango selects fonts based on complex logic (looking at requirements for individual glyphs) so I'm not surprised it uses a different font for the copyright symbol versus the rest.

With Wt, you do not need to use any haru API to deal with fonts. WPdfImage/WTextRenderer does that by itself (using the system described in the post above), i.e. based on CSS in the HTML.

WT_NO_STD_WSTRING is for embedded systems without wide string support. The good news is that it's relatively unused meaning that even poor embedded systems provide more localization/i18n things out of the box. But indeed we could fix that.

Back to your problem, there are no errors from libharu that it tried to load a font it doesn't know how to handle?

Also, that the copyright symbol isn't being shown is because perhaps because without pango, Wt's font selection isn't smart enough to select a font that has this symbol (only very few fonts do, I guess).

Regards,

koen

Actions #10

Updated by Michael Shestero almost 10 years ago

I solve it.

HTML:

<p style=\"font-family:FreeMono;\">Copiright mark: &copy;<br/>1234567890  русский</p>

goes to PDF as needed.

Only "out of rules" thing I have to do is to make soft link freemono.ttf to FreeMono.ttf in my /usr/share/fonts/truetype/freefont/ directory.

Your library or may be XHTML parser lowercase the file name before it comes into HPDF_LoadTTFontFromFile !

That's a point the developers should noticed I guess.

Is it the more appropriate way to solve this feature?

Also I cannot use this font before WPdfRenderer (as you can see in my source above). If I use it I've got 0x1019 "Tried to load a font that has been registered" libHaru error.

Thank you, my task is solved. Now I'm going to reproduce this example with MinGW under Windows.

Actions #11

Updated by Koen Deforche over 9 years ago

  • Status changed from Resolved to InProgress
  • Assignee set to Koen Deforche

Hey,

Thanks for the info. There is indeed a problem with case handling in the non-pango code path.

As to the error 0x1019, this has been fixed in haru in more recent git versions.

Regards

koen

Actions #12

Updated by Koen Deforche over 9 years ago

  • Status changed from InProgress to Resolved
  • Target version changed from 3.3.3 to 3.3.4
Actions #13

Updated by Michael Shestero over 9 years ago

Now I built libHaru and Wt under Windows.

Found the following bug:

The plain cyrilic text goes ok. Also I no problem with text inside "p" tags and even inside tables.

But inside "b" tags (bold text) and insied "h3" it turns into question marks! This didn't cured changing font-family style for those tags.

Also found that if I use patched as described above libharu-RELEASE_2_3_0RC2 and my table spreads to several pages in PDF then from the second page the cyrilic letters turns into question marks! But if I use libharu-libharu-ec89be4 sources all the pages are correct!

Actions #14

Updated by Michael Shestero over 9 years ago

Upd: if I do style="font-weight: bold;" on some tag like tr, td the cyrilic letters turn into questions!

Actions #15

Updated by Koen Deforche over 9 years ago

Without libpango, font-selection (translating CSS to a TTF file) uses rather simplistic rules (i.e. taking the font name, and papending 'bold' etc... to convert to a filename). If you want decent font selection, you need pango.

If Wt cannot match a TTF font, it reverts to the default postscript fonts, which does not support unicode and which is why you get the question marks that occur by narrowing down the text.

Actions #16

Updated by Michael Shestero over 9 years ago

I think the CSS, fonts and libPango isn't the reason.

The font is the in "b" tag is the same and I can actually change font using style attribute in p tag (keeping cyrilic letters).

It looks like a recursive call in deep of the renderer narrowing UTF-8 characters.

Actions #17

Updated by Koen Deforche over 9 years ago

The only reason why Wt will narrow the font is when it cannot use a TTF font. The font file that contains the 'bold' font thus has a file name that is unexpected for Wt (simple) logic.

Actions #18

Updated by Michael Shestero over 9 years ago

yes I was wrong. The nested tags do fine.

Still it looks like a bug. It's sore if I need libPango only to draw text bold. Please find in attachment my test.

<p>Тест - проверка 1 ok</p>
<p style="font-family: Arial;">проверка 2 ok</p>
<p style="font-weight: normal;">проверка 3 ok</p>
<p style="font-weight: normal; font-family: Arial;">проверка 4 ok</p>
<p style="font-weight: bold; font-family: Arial;">проверка 5 FAIL
    <p style="font-weight: normal; font-family: Arial;">проверка 6 ok (nested)</p>
</p>
Actions #19

Updated by Michael Shestero over 9 years ago

My problem is still not solved. :-(

I tried to build WT with libPango but it looks very heavy.

Is it possible to render bold cyrilic letters to PDF without it?

Also I discovered that my program (works under Windows XP 32) don't work under Windows 7 64 bit: libHaru claims that it cannot open c:\windows\fonts\arial.ttf (the default font). Although this file is at the place. Why?

Actions #20

Updated by Michael Shestero over 9 years ago

Please help!

Monthes passed but my problem is still not solved!

Recently I had to rebuild everything on another machine.

I used the latest wt-3.3.3 and the latest libHaru release.

The russian plain text is passing into PDF normal, but if I put it into or tags it corrupts and I see "WString: narrow(): loss of detail: ..." console errors! It's looks again not like font-matter.

Also I see that it sometimes cannot open correct arial.ttf (default) font (especially if it is not in the current

directory, but specified by full path).

libPango is too huge I wonder that I should use such complex thing to solve simple task, I either don't indeed change the fonts name!

That's bad but not fatal: I can just cut and taga from my XHTML.

The worst of all that on the following HTML (see the atachment) your library dive into dead loop continuing to write this error ("WString: narrow(): loss of detail: ...")!... :-(

Actions #21

Updated by Koen Deforche over 9 years ago

  • Status changed from Resolved to Closed
Actions

Also available in: Atom PDF