DynaPDF Manual - Page 486

Previous Page 485   Index   Next Page 487

Function Reference
Page 486 of 839
%It would be returned in one GetPageText() call as one coherent kerning
%record.
(The fox eats the lazy mouse.)Tj
%This version emulates the spaces with kerning space.
%It would be returned in one GetPageText() call with 6 kerning records.
[(The)-280(fox)-280(eats)-280(the)-280(lazy)-280(mouse.)]TJ
%This version uses PDF positioning operators to emulate spaces.
%It produces 6 separate GetPageText() calls.
(The)Tj
2.8 0 Td
(fox)Tj
2.8 0 Td
(eats)Tj
2.8 0 Td
(the)Tj
2.8 0 Td
(lazy)Tj
2.8 0 Td
(mouse.)Tj
In the worst case each text record consists of only one character and it is also possible that the entire
text occurs unsorted or combined with other texts which lie on completely different positions than
this one. There is not necessarily a logical connection between what you see on screen and what is
stored in the PDF file. Especially if a PDF file contains tables the order of text records is sometimes
very difficult to understand.
Possible encoding issues
If text must be extracted, deleted, or replaced then it is very important that the text in the PDF file
can be converted to Unicode. This conversion is possible if the font uses a standard encoding like
WinAnsi or MacRoman, if it contains a ToUnicode CMap, or if it contains PostScript Character
names which are listed in the Adobe Glyph List, or if it uses a predefined external CMap and if this
CMap is available in one of the CMap search paths (SetSetCMapDir() for further information).
More complicated is the processing of certain European scripts such as Russian, Greek, Czech, and
so on. A common technique to process such scripts is to convert the original font to a symbol font to
avoid the usage of a CID font (multi-byte font) because the PDF format supports only four pre-
defined 8 bit encodings (WinAnsi, MacRoman, MacExpert, and Symbol). The advantage is that 8 bit
strings can be stored in the PDF file which results in a smaller file size and the PDF file is still
compatible to older Acrobat versions prior 4.0 because CID fonts are supported since PDF 1.3.
The problem is that if the font resource contains no ToUnicode CMap or PostScript character names
it is no longer possible to convert the text to Unicode. Depending on how a PDF file was created the
encoding is also often not known by the PDF driver, e.g. when converting PCL or AFP files to PDF.
 

Previous topic: Organization of content streams and pages, Organization of text objects

Next topic: How to calculate the absolute string position?, How to caluculate the font size?