DynaPDF Manual - Page 512
Previous Page 511 Index Next Page 513

Function Reference
Page 512 of 874
To process strings of such fonts correctly DynaPDF must be able to load required CMap files if
necessary. Therefore, DynaPDF is delivered with the most important CMap files which are provided
by Adobe Systems. These CMaps can be found in the DynaPDF installation directory at
/Resource/CMap/. Applications which extract text from PDF files should include these CMaps so
that they can be loaded at runtime.
The search path to external CMaps must be set with SetCMapDir() before executing GetPageText()
the first time. The function creates a CMap cache that is hold in memory until the PDF instance will
be deleted. The search path(s) to external CMap files should be set only one time per PDF instance
and one PDF instance should be used to process so many PDF files as possible. This can significantly
improve processing speed.
Order of Text records
GetPageText() returns always when a text showing operator was found. That means the returned
text represents not a text line. It can be a single character up to a complete text line depending on
how the text is stored in the PDF file.
The order in which text is returned is essentially arbitrary. It depends on the file creator whether
text is stored in the logical reading order. For example, most PDF drivers convert headers and
footers first. Such strings appear then at the beginning of the content stream. All other strings are in
turn not necessarily ordered and one text line can be stored in several different text objects.
A text search or text replacement algorithm must correctly handle cases in which a word or sentence
is separated into different text objects. In the worst case GetPageText() returns always only a single
character. As long as the text is not rotated it is relatively easy to determine whether a text record
lies on the same y-axis, but finding an arbitrary rotated text that is also stored in several different
text objects requires further math.
The position of a text object is calculated from the two transformation matrices ctm and tm. The
global transformation matrix ctm represents the current coordinate system when a text showing
operator was found. The matrix ctm is already pre-multiplied because GetPageText() does not return
when a new transformation matrix is applied.
The text transformation matrix tm represents the text coordinate system in which text properties
such as text width, font size, character spacing, word spacing, or the space width are calculated. All
text positioning operators are already included in this matrix.
The combination of both matrices represents the final user space in which the text is rendered. Both
matrices must be combined to enable the calculation of the text position and orientation (see the
examples on the following pages to determine how the matrices must be combined).
Organization of content streams and pages
A PDF page consists of a content stream and a resource array which contains the resources such as
fonts, images, and so on which can be used by the page. The content stream contains the PDF
operators which paint the contents of a PDF page.
Previous topic: External CMaps
Next topic: Organization of text objects