DynaPDF Manual - Page 515
Previous Page 514 Index Next Page 516

Function Reference
Page 515 of 874
In the worst case each text record consists of only one character and it is also possible that the entire
text occurs unsorted or combined with other texts which lie on completely different positions than
this one. There is not necessarily a logical connection between what you see on screen and what is
stored in the PDF file. Especially if a PDF file contains tables the order of text records is sometimes
very difficult to understand.
Possible encoding issues
If text must be extracted, deleted, or replaced then it is very important that the text in the PDF file
can be converted to Unicode. This conversion is possible if the font uses a standard encoding like
WinAnsi or MacRoman, if it contains a ToUnicode CMap, or if it contains PostScript Character
names which are listed in the Adobe Glyph List, or if it uses a predefined external CMap and if this
CMap is available in one of the CMap search paths (SetSetCMapDir() for further information).
More complicated is the processing of certain European scripts such as Russian, Greek, Czech, and
so on. A common technique to process such scripts is to convert the original font to a symbol font to
avoid the usage of a CID font (multi-byte font) because the PDF format supports only four pre-
defined 8 bit encodings (WinAnsi, MacRoman, MacExpert, and Symbol). The advantage is that 8 bit
strings can be stored in the PDF file which results in a smaller file size and the PDF file is still
compatible to older Acrobat versions prior 4.0 because CID fonts are supported since PDF 1.3.
The problem is that if the font resource contains no ToUnicode CMap or PostScript character names
it is no longer possible to convert the text to Unicode. Depending on how a PDF file was created the
encoding is also often not known by the PDF driver, e.g. when converting PCL or AFP files to PDF.
Such PDF files can be viewed and printed correctly but it is not possible to extract human readable
strings from them.
How to calculate the absolute string position?
The absolute string position can be calculated from the matrices ctm and tm. Before the string
position can be computed the matrix tm must be transformed into user space. This can be done by
multiplying the matrices ctm and tm into another one:
TCTM MulMatrix(TCTM &M1, TCTM &M2)
{
TCTM retval;
retval.a = M2.a * M1.a + M2.b * M1.c;
retval.b = M2.a * M1.b + M2.b * M1.d;
retval.c = M2.c * M1.a + M2.d * M1.c;
retval.d = M2.c * M1.b + M2.d * M1.d;
retval.x = M2.x * M1.a + M2.y * M1.c + M1.x;
retval.y = M2.x * M1.b + M2.y * M1.d + M1.y;
return retval;
}
The usage is as follows:
TCTM m = MulMatrix(stack.ctm, stack.tm);
Previous topic: Organization of text objects
Next topic: How to caluculate the font size?