<META content="Matthew T. Russotto" name=author>
<META content="MSHTML 6.00.2715.400" name=GENERATOR></HEAD>
<BODY>
<H1>Microsoft's HTML Help (.chm) format</H1>
<H2 id=preface>Preface</H2>
<P>This is documentation on the .chm format used by Microsoft HTML Help. This
format has been reverse engineered in the past, but as far as I know this is the
first freely available documentation on it. One Usenet message indicates that
these .chm files are actually IStorage files documented in the Microsoft
Platform SDK. However, I have not been able to locate such documentation. </P>
<H3>Note</H3>
<P>The word "section" is badly overloaded in this document. Sorry about
that.</P>
<P>All numbers are in hexadecimal unless otherwise indicated in the text. Except
in tabular listings, this will be indicated by $ or 0x as appropriate. All
values within the file are Intel byte order (little endian) unless indicated
otherwise. </P>
<H2 id=overview>The overall format of a .chm file</H2>
<P>The .chm file begins with a short ($38 byte) initial header. This is followed
by the header section table, the offset to the content, and a number of bytes of
information of unknown use. Collectively, this is the "header". </P>
<P>The header is followed by the header sections. There are two header sections.
One header section is the file directory, the other contains the file length and
some unknown data. Immediately following the header sections is the content.
</P>
<H2 id=header>The Header</H2>
<P>The header starts with the initial header, which has the following format
</P><PRE>0000: char[4] 'ITSF'
0004: DWORD 3 (Version number)
0008: DWORD Total header length, including header section table and
following data.
000C: DWORD 1 (unknown)
0010: DWORD a timestamp.
0014: DWORD Windows Language ID. The two I've seen
$0409 = LANG_ENGLISH/SUBLANG_ENGLISH_US
$0407 = LANG_GERMAN/SUBLANG_GERMAN
0018: GUID {7C01FD10-7BAA-11D0-9E0C-00A0-C922-E6EC}
0028: GUID {7C01FD11-7BAA-11D0-9E0C-00A0-C922-E6EC}
</PRE>
<P>Note: a GUID is $10 bytes, arranged as 1 DWORD, 2 WORDs, and 8 BYTEs.</P>
<P>It is followed by the header section table, which is 2 entries, where each
entry is $10 bytes long and has this format: </P><PRE>0000: QWORD Offset of section from beginning of file
0008: QWORD Length of section
</PRE>
<P>Following the header section table is 8 bytes of additional header data. In
Version 2 files, this data is not there and the content section starts
immediately after the directory. </P><PRE>0000: QWORD Offset within file of content section 0
</PRE>
<H2 id=headersections>The Header Sections</H2>
<H3>Header Section 0</H3>
<P>This section contains the total size of the file, and not much else </P><PRE>0000: DWORD $01FE (unknown)
0004: DWORD 0 (unknown)
0008: QWORD File Size
0010: DWORD 0 (unknown)
0014: DWORD 0 (unknown)
</PRE>
<H3>Header Section 1: The Directory Listing</H3>
<P>The central part of the .chm file: A directory of the files and information
it contains. </P>
<H4>Directory header</H4>
<P>The directory starts with a header; its format is as follows: </P><PRE>0000: char[4] 'ITSP'
0004: DWORD Version number 1
0008: DWORD Length of the directory header
000C: DWORD $0a (unknown)
0010: DWORD $1000 Directory chunk size
0014: DWORD "Density" of quickref section, usually 2.
0018: DWORD Depth of the index tree
1 there is no index, 2 if there is one level of PMGI
chunks.
001C: DWORD Chunk number of root index chunk, -1 if there is none
(though at least one file has 0 despite there being no
index chunk, probably a bug.)
0020: DWORD Chunk number of first PMGL (listing) chunk
0024: DWORD Chunk number of last PMGL (listing) chunk
0028: DWORD -1 (unknown)
002C: DWORD Number of directory chunks (total)
0030: DWORD Windows language ID
0034: GUID {5D02926A-212E-11D0-9DF9-00A0C922E6EC}
0044: DWORD $54 (This is the length again)
0048: DWORD -1 (unknown)
004C: DWORD -1 (unknown)
0050: DWORD -1 (unknown)
</PRE>
<H4>The Listing Chunks</H4>
<P>The header is directly followed by the directory chunks. There are two types
of directory chunks -- index chunks, and listing chunks. The index chunk will be
omitted if there is only one listing chunk. A listing chunk has the following
format: </P><PRE>0000: char[4] 'PMGL'
0004: DWORD Length of free space and/or quickref area at end of
directory chunk
0008: DWORD Always 0.
000C: DWORD Chunk number of previous listing chunk when reading
directory in sequence (-1 if this is the first listing chunk)
0010: DWORD Chunk number of next listing chunk when reading
directory in sequence (-1 if this is the last listing chunk)
0014: Directory listing entries (to quickref area) Sorted by
filename; the sort is case-insensitive.
</PRE>
<P>The quickref area is written backwards from the end of the chunk. One
quickref entry exists for every n entries in the file, where n is calculated as
1 + (1 << quickref density). So for density = 2, n = 5. </P><PRE>Chunklen-0002: WORD Number of entries in the chunk
Chunklen-0004: WORD Offset of entry n from entry 0
Chunklen-0008: WORD Offset of entry 2n from entry 0
Chunklen-000C: WORD Offset of entry 3n from entry 0
...
</PRE>
<P>The format of a directory listing entry is as follows </P><PRE> BYTE: length of name
BYTEs: name (UTF-8 encoded)
ENCINT: content section
ENCINT: offset
ENCINT: length
</PRE>
<P>The offset is from the beginning of the content section the file is in, after
the section has been decompressed (if appropriate). The length also refers to
length of the file in the section after decompression. </P>
<P>There are two kinds of file represented in the directory: user data and
format related files. The files which are format-related have names which begin
with '::', the user data files have names which begin with "/". </P>
<H4>The Index Chunk</H4>
<P>An index chunk has the following format </P><PRE>0000: char[4] 'PMGI'
0004: DWORD Length of quickref/free area at end of directory chunk
0008: Directory index entries (to quickref/free area)
</PRE>
<P>The quickref area in an PMGI is the same as in an PMGL </P>
<P>The format of a directory index entry is as follows </P><PRE> BYTE: length of name
BYTEs: name (UTF-8 encoded)
ENCINT: directory listing chunk which starts with name
</PRE>
<P>When higher-level indexes exist (when the depth of the index tree is 3 or
higher), presumably the upper-level indexes will contain the numbers of
lower-level index chunks rather than listing chunks </P>
<H4>Encoded Integers</H4>
<P>An ENCINT is a variable-length integer. The high bit of each byte indicates
"continued to the next byte". Bytes are stored most significant to least
significant. So, for example, $EA $15 is (((0xEA&0x7F)<<7)|0x15) =
0x3515. </P>
<H2 id=content>The Content</H2>
<P>The content typically immediately follows the header sections, and is at the
location indicated by the DWORD following the header section table. All content
section 0 locations in the directory are relative to that point. The other
content sections are stored WITHIN content section 0. </P>
<H3>The Namelist file</H3>
<P>There exists in content section 0 and in the directory a file called
"::DataSpace/NameList". This file contains the names of all the content
sections. The format is as follows: </P><PRE>0000: WORD Length of file, in words
0002: WORD Number of entries in file
Each entry:
0000: WORD Length of name in words, excluding terminating null
0002: WORD Double-byte characters
xxxx: WORD 0
</PRE>
<P>Yes, the names have a length word
CHM文件的文件格式
4星 · 超过85%的资源 需积分: 9 138 浏览量
2009-01-09
02:47:19
上传
评论
收藏 5KB ZIP 举报
dream8346
- 粉丝: 4
- 资源: 15
最新资源
- 装修通用报价参考,基础施工项目+水电工程项目+瓦木项目,超级详细
- 三菱PLC例程源码Medocsequencegenerator
- 三菱PLC例程源码M1320磨头进出FX1s控制步进电机,有注释
- STRASSEN矩阵乘法算法(改进分治法·C语言)
- 前端.xmind前端.xmind前端.xmind前端.xmind前端.xmind
- 三菱PLC例程源码LOW-E玻璃镀膜线程序(三菱QPLC的)一万步带注释
- 三菱PLC例程源码LCD设备蚀刻机程序
- 三菱PLC例程源码LCD设备蚀刻机
- 全面前端开发指南:从基础到深入
- pvk2pfx 32位 Pvk2Pfx (Pvk2Pfx.exe) 是一种命令行工具,可将 .spc、.cer 和 .pvk 文
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈