
Portable Document Format (PDF) is a file format created for the document exchange. Each PDF file encapsulates a complete description of a fixed-layout 2D document (and, with Acrobat 3D, embedded 3D documents) that includes the text, fonts, images, and 2D vector graphics which compose the documents.
PDF implements documents as a hierarchy of tagged objects organized into trees and/or linked lists. The objects can encapsulate various types of content, or attributes, or pointers to external resources.
There are eight basic kinds of objects in PDF: Booleans, numbers, names, strings, arrays, dictionaries, streams and the null object. Let’s take a look at some of these objects we need to work with.
Strings
In PDF a string consists of a series of 8-bit bytes surrounded by parentheses. A string can be divided into several lines by using the backslash (\) at the end of the line. The backslash itself is not considered as part of the string. For example:
( This is a string. )
( This is a longer \
string. )
Any 8-bit value can be represented either by its octal equivalent (in the form \ddd, where ddd is the octal number), or by its two-digit hex equivalent, surrounded by angle brackets. Later we will search for the text data in the strings.
|
Post new comment