DynaPDF Manual - Page 641
Previous Page 640 Index Next Page 642

Function Reference
Page 641 of 874
fntTranslateRawString2() in C/C++). For this kind of algorithm the TShowTextArrayW
callback function should be used because it provides anything required to develop fast text
extraction algorithms. The example projects text_extraction and text_coordinates
demonstrate how text extraction algorithms can be developed.
• Text search algorithms could use the TShowTextArrayW callback function too but the usage
is much more complicated if strings of CID fonts must be processed. CID fonts support
encodings with arbitrary code lengths from one through four bytes per character. Because
the string width cannot be computed from the translated Unicode string the function must
be able to find the position in the source string. This is not easy especially if the search text
was stored in multiple text records.
To simplify the development of text search algorithms the content parser provides the
TShowTextArrayA callback function which returns the raw source strings. The conversion to
Unicode can be done with TranslateRawCode() (the name is fntTranslateRawCode() in
C/C++). The function converts a sequence of source bytes to Unicode and calculates the
width of that character. The advantage is that the exact position of every character in a string
can be easily calculated independent of the current font type. The overhead due to the call on
a per character basis is not large because the function is strongly optimized to improve
processing speed. The example text_search demonstrates how a text search algorithm can be
developed.
Marked Content Sequences and Layers
The callback functions BeginMarkedContent and EndMarkedContent are called for marked content
sequences including layers. The older callback functions BeginLayer / EndLayer are executed only if
BeginMarkedContent / EndMarkedContent are not set.
Note also that BeginMarkedContent / EndMarkedContent or BeginLayer / EndLayer are executed
as usual but no drawing operator is executed if the current visibility state is invisible. Images and
templates (Form XObject in PDF terms) are ignored if the visibility state is invisible.
In addition, path operators and the corresponding callback functions are executed if the visibility
state is invisible but an operator that draws a path is not executed.
This is because the terminating operator specifies how the path must be processed. Since clipping
paths are pushed to the graphics state as usual path operators cannot be ignored.
The visibility state does not affect the graphics state at all. Every operator that changes the graphics
state must be considered.
Using the Content Parser
The content parser can be used to extract text, vector graphics, and images from a PDF file. The
following sections describe which callback functions must set, what must be stored in the graphics
state, as well as other important aspects.
Previous topic: Text Scaling, Sub string coordinates
Next topic: Text Extraction or Text Search Algorithms