There are two types of HTML files - structured documents using headings (H1, H2, etc.) which HTMLDOC calls "books", and unstructured documents that do not use headings which HTMLDOC calls "web pages".
A very common mistake is to try converting a web page using:
htmldoc -f filename.pdf filename.html
which will likely produce a PDF file with no pages. To convert web
page files you must use the --webpage
option at the
command-line or choose Web Page in the input tab of the GUI.
HTMLDOC does not support HTML 4.0 elements, attributes, stylesheets, or scripting.
The following HTML elements are recognized by HTMLDOC:
Element | Version | Supported? | Notes |
---|---|---|---|
!DOCTYPE | 3.0 | Yes | DTD is ignored |
A | 1.0 | Yes | See Below |
ACRONYM | 2.0 | Yes | No font change |
ADDRESS | 2.0 | Yes | |
AREA | 2.0 | No | |
B | 1.0 | Yes | |
BASE | 2.0 | No | |
BASEFONT | 1.0 | No | |
BIG | 2.0 | Yes | |
BLINK | 2.0 | No | |
BLOCKQUOTE | 2.0 | Yes | |
BODY | 1.0 | Yes | |
BR | 2.0 | Yes | |
CAPTION | 2.0 | Yes | See Below |
CENTER | 2.0 | Yes | |
CITE | 2.0 | Yes | Italic/Oblique |
CODE | 2.0 | Yes | Courier |
DD | 2.0 | Yes | |
DEL | 2.0 | Yes | Strikethrough |
DFN | 2.0 | Yes | Helvetica |
DIR | 2.0 | Yes | |
DIV | 3.2 | Yes | |
DL | 2.0 | Yes | |
DT | 2.0 | Yes | Italic/Oblique |
EM | 2.0 | Yes | Italic/Oblique |
EMBED | 2.0 | Yes | HTML Only |
FONT | 2.0 | Yes | See Below |
Element | Version | Supported? | Notes |
FORM | 2.0 | No | |
FRAME | 3.2 | No | |
FRAMESET | 3.2 | No | |
H1 | 1.0 | Yes | Boldface, See Below |
H2 | 1.0 | Yes | Boldface, See Below |
H3 | 1.0 | Yes | Boldface, See Below |
H4 | 1.0 | Yes | Boldface, See Below |
H5 | 1.0 | Yes | Boldface, See Below |
H6 | 1.0 | Yes | Boldface, See Below |
HEAD | 1.0 | Yes | |
HR | 1.0 | Yes | See Below |
HTML | 1.0 | Yes | |
I | 1.0 | Yes | |
IMG | 1.0 | Yes | See Below |
INPUT | 2.0 | No | |
INS | 2.0 | Yes | Underline |
ISINDEX | 2.0 | No | |
KBD | 2.0 | Yes | Courier Bold |
LI | 2.0 | Yes | |
LINK | 2.0 | No | |
MAP | 2.0 | No | |
MENU | 2.0 | Yes | |
META | 2.0 | Yes | See Below |
MULTICOL | N3.0 | No | |
NOBR | 1.0 | No | |
NOFRAMES | 3.2 | No | |
OL | 2.0 | Yes | |
OPTION | 2.0 | No | |
P | 1.0 | Yes | |
PRE | 1.0 | Yes | |
Element | Version | Supported? | Notes |
S | 2.0 | Yes | Strikethrough |
SAMP | 2.0 | Yes | Courier |
SCRIPT | 2.0 | No | |
SELECT | 2.0 | No | |
SMALL | 2.0 | Yes | |
SPACER | N3.0 | Yes | |
STRIKE | 2.0 | Yes | |
STRONG | 2.0 | Yes | Boldface Italic/Oblique |
SUB | 2.0 | Yes | Reduced Fontsize |
SUP | 2.0 | Yes | Reduced Fontsize |
TABLE | 2.0 | Yes | See Below |
TD | 2.0 | Yes | |
TEXTAREA | 2.0 | No | |
TH | 2.0 | Yes | Boldface Center |
TITLE | 2.0 | Yes | |
TR | 2.0 | Yes | |
TT | 2.0 | Yes | Courier |
U | 1.0 | Yes | |
UL | 2.0 | Yes | |
VAR | 2.0 | Yes | Helvetica Oblique |
WBR | 1.0 | No |
HTMLDOC supports many special HTML comments to initiate page breaks, set the header and footer text, and control the current media options:
<!-- FOOTER LEFT "foo" -->
<!-- FOOTER CENTER "foo" -->
<!-- FOOTER RIGHT "foo" -->
<!-- HALF PAGE -->
<!-- HEADER LEFT "foo" -->
<!-- HEADER CENTER "foo" -->
<!-- HEADER RIGHT "foo" -->
<!-- MEDIA BOTTOM nnn -->
<!-- MEDIA COLOR "foo" -->
<!-- MEDIA DUPLEX NO -->
<!-- MEDIA DUPLEX YES -->
<!-- MEDIA LANDSCAPE NO -->
<!-- MEDIA LANDSCAPE YES -->
<!-- MEDIA LEFT nnn -->
<!-- MEDIA POSITION nnn -->
<!-- MEDIA RIGHT nnn -->
<!-- MEDIA SIZE foo -->
<!-- MEDIA TOP nnn -->
<!-- MEDIA TYPE "foo" -->
<!-- NEED length -->
length
units left on the current page. The length
value defaults to lines of text but can be suffixed by
in
, mm
, or cm
to
convert from the corresponding units.
<!-- NEW PAGE -->
<!-- NEW SHEET -->
<!-- NUMBER-UP nn -->
<!-- PAGE BREAK -->
The HEADER
and FOOTER
comments
allow you to set an arbitrary string of text for the left,
center, and right headers and footers. Each string consists of
plain text; special values or strings can be inserted using the
dollar sign ($
):
$$
CHAPTER
$CHAPTERPAGE
$CHAPTERPAGE(format)
$CHAPTERPAGES
$CHAPTERPAGES(format)
$DATE
$HEADING
$LOGOIMAGE
$PAGE
$PAGE(format)
$PAGES
$PAGES(format)
$TIME
$TITLE
Limited typeface specification is currently supported to ensure portability across platforms and for older PostScript printers:
Requested Font | Actual Font |
---|---|
Arial | Helvetica |
Courier | Courier |
Helvetica | Helvetica |
Monospace | Courier |
Sans-Serif | Helvetica |
Serif | Times |
Symbol | Symbol |
Times | Times |
All other unrecognized typefaces are silently ignored.
Currently HTMLDOC supports a maximum of 1000 chapters
(H1 headings). This limit can be increased by changing the
MAX_CHAPTERS
constant in the config.h
file included with the source code.
All chapters start with a top-level heading (H1) markup. Any headings within a chapter must be of a lower level (H2 to H15). Each chapter starts a new page or the next odd-numbered page if duplexing is selected.
Note:
Heading levels 7 to 15 are not standard HTML and will not likely be recognized by most web browsers. |
The headings you use within a chapter must start at level 2 (H2). If you skip levels the heading will be shown under the last level that was known. For example, if you use the following hierarchy of headings:
<H1>Chapter Heading</H1> ... <H2>Section Heading 1</H2> ... <H2>Section Heading 2</H2> ... <H3>Sub-Section Heading 1</H3> ... <H4>Sub-Sub-Section Heading 1</H4> ... <H4>Sub-Sub-Section Heading 2</H4> ... <H3>Sub-Section Heading 2</H3> ... <H2>Section Heading 3</H2> ... <H4>Sub-Sub-Section Heading 3</H4> ...the table-of-contents that is generated will show:
VALUE="#"
TYPE="1"
TYPE="a"
TYPE="A"
TYPE="i"
TYPE="I"
External URL and internal (#target
and
filename.html
) links are fully supported for HTML
and PDF output.
When generating PDF files, local PDF file links will be converted to external file links for the PDF viewer instead of URL links. That is, you can directly link to another local PDF file from your HTML document with:
<A HREF="filename.pdf">...</A>
HTMLDOC supports the following META
attributes for the title page and document information:
<META NAME="AUTHOR" CONTENT="..."
<META NAME="COPYRIGHT" CONTENT="..."
<META NAME="DOCNUMBER" CONTENT="..."
<META NAME="GENERATOR" CONTENT="..."
<META NAME="KEYWORDS" CONTENT="..."
<META NAME="SUBJECT" CONTENT="..."
BREAK
attribute
is still supported by the HR
element:
<HR BREAK>Support for the
BREAK
attribute is deprecated and will be
removed in a future release of HTMLDOC.
MAX_COLUMNS
constant in the config.h file
included with the source code.
HTMLDOC supports HTML 3.0 tables with the following exceptions:
CAPTION
element is always shown at the top
of the table.
HTMLDOC does not support HTML 4.0 table elements or
attributes, such as TBODY
, THEAD
,
TFOOT
, or RULES
.