DOC文件的结构分析

The Doc format is the de facto standard for large text documents on the Palm Computing Platform. It enjoys wide support in both software and content, but documentation is sparse. This document is an attempt to describe the Doc format for the edification of programmers who are interested in writing Doc-compatible software, and to encourage programmers not to break the format in incompatible ways.
This document is totally unofficial, and derived from examination of existing Doc files and applications.
Overview lzfD
A Doc-format e-text is an ordinary PalmPilot database, represented on the desktop by a file in the standard .prc/.pdb format. (Describing that format is currently beyond the scope of this document.) The database is divided into three sections, which appear in order:
A header record ^<;w
A series of text records bi,
A series of bookmark records
Note that all values are stored MSB first, as is usual on the PalmPilot.
record_size 2 bytes maximum size of each record (usually 4096; see below)
position 4 bytes currently viewed position in the document
sizes 2*records bytes record size array
The position field is not used by all readers; some store this information elsewhere.
AportisDoc (Reader and Mobile Edition) set spare to 0x0003, and overwrite the first two bytes of length with zeros (even if the document is more than 64k bytes in length!) upon first opening the document.
The sizes array is a list of two-byte unsigned integers giving the uncompressed size of each text record, in order. It is created by some readers (AportisDoc, TealDoc, Doc, and possibly others) when the document is first opened.
Text Records
Following the header record is a series of text records, each one of which represents a text block no greater than record_size bytes in length. Most conversion software creates blocks of 4096 bytes (except for the last one); the format provides for other block sizes and for records of varying lengths, but it is likely that some Doc-handling software cannot deal with anything but fixed 4096-byte records.
In a version 1 database, each block of text is simply stored in a single record. In a version 2 database, each block of text is individually compressed, making the actual record size somewhat smaller -- note that the block size refers to the uncompressed size of a text block.
Compression Algorithm
Note: The original designer of the Doc compression format, Pat Beirne, has reposted one of his original messages describing the algorithm. If you are curious about why it works the way it does, check it out.
Each text block (in a version 2 database) is individually compressed using a simple one-pass algorithm. As I am far from an expert in compression algorithm design, I shall simply describe what the data looks like and refer anyone interested in more details to the code (which is readily available in a variety of places, such as in the source to txt2pdbdoc or the source to Pyrite.
The output of the compression algorithm is a stream of bytes, described here with the action taken by the decompressor when they are encountered:
　 !^-OfqIHfV
Compression Byte Codes 90(UgK&Y
0x01-0x09 Copy the following N bytes verbatim dXDXRY.FMQ
0x0a-0x7f Pass through as-is | F8]Xnds
0x80-0xbf Copy a sequence from a previous part of the block pSXEJ 2k
0xc0-0xff Insert a space followed by N xor 0x80 d96fjj~
When a copy-sequence byte code is encountered, it is used as the high byte of a two byte quantity, along with the next byte in the data (resulting in a value from 0x8000-0xbfff). This value is then ANDed with 0x3fff, resulting in a value from 0x0000 to 0x3fff. It is further subdivided into an offset (the upper 11 bits, which are shifted down appropriately) and a length (the lower 3 bits). The actual data in the output is located by subtracting the offset from the current position in the decompressed data; the number of bytes copied is equal to the length plus 3. RB *P0
Bookmark Records 9LHa&""
Following the text records is an optional series of bookmark records. Each bookmark occupies a single record, and they are usually presented by the reader in the same order they appear in the database. The format of a bookmark record is rather simple: yE<,ZJ[n
name 16 bytes bookmark name (up to 15 characters, null terminated) Zq^^|[)bA
position 4 bookmark position, from beginning of text 1B(G]o_>!
Note that the bookmark name field is always 16 bytes wide, even if the name is shorter, and that the position is in actual text bytes before compression. my]P_mE
Common Conventions ^SgN(-QH
Bookmark Autoscan ; tm3B2
Because most Doc creation programs do not add bookmark records to their output, most Doc readers support an alternative method for authors to specify bookmark locations in a document. The reader scans the document the first time it is opened, looking for a specified string at the start of lines. Each time it is found, the reader adds a bookmark using the text on the rest of the line. By convention, the text to scan for is placed on the last line of the document, surrounded by angle brackets (< and >). :ET x*c
TealDoc-Specific Extensions UYH|?Jw!N
The current TealDoc extensions are implemented by the use of HTML-like tags embedded in the text of the document. Although TealDoc tags look like HTML, TealDoc's parser is not as robust as that of a desktop web browser; the following limitations have been observed in practice: \\_?yzgf
Tags, attributes, and keyword values must be in all upper case } mgVC
Each tag must appear alone on a single line; attempting to embed a tag in the middle of a line of text will cause unpredictable results. _@7(g(pY 3
Text attribute values should be surrounded by double quotes; keyword and numeric values should not be quoted. ta@ ISRK
Other Extensions Sio1Q0
Besides TealDoc, other Doc readers also extend the standard e-text database format. Some of these extensions will be more fully documented later; for the time being, this section contains a few notes in the hopes that future developers will be able to avoid compatibility problems. Please note that the notes in this section should not be considered authoritative or complete; if you are developing Doc software, you should investigate this stuff for yourself. 8 qn{
QED Extensions Xad G\\_?t`
QED, the Doc editor from Visionary 2000, adds an appinfo block, simultaneously marking the document with its version number (in the database header). \\MF3CK@/
RichReader Extensions y/z9Ce*>
RichReader, the rich text document reader by Michael Arena, supports formatting control codes (font changes, indentation, etc.) embedded in the document text. When viewed on another reader, RichReader documents may appear to contain "garbage" characters, since many of the formatting codes use non-printable or extended ASCII characters. =CWc`
LinkDoc Extensions b$PT_!d
Mobile LinkDoc, a reader from Mobile Generation Software, stores links between documents by adding extended bookmark records to the document being linked from. dBsRm{aS
Extensions Which Do Not Affect the Doc Format fLLnf].O
A number of readers (nearly all of them, in fact) store additional information in databases separate from the documents themselves, leaving the documents unaltered. For example, category information is normally stored externally. These product-specific databases will not, at the present time, be documented here, because they do not affect the document format itself.

上一篇：成都市政府ESX Server 误删除虚拟机恢复案例

下一篇：专业数码照片CR2、TIF恢复成功

<<>