This page describes, in Unix manual page style, a C++ program
available for downloading from this site
which allows you to translate human-readable electronic
texts into a variety of forms for electronic publication.
NAME
etset - translate ISO 8859 Etext to LaTeX, HTML, PML, or ASCII
SYNOPSIS
etset [ options ] [ infile [ outfile ] ]
DESCRIPTION
etset
allows human readable electronic books, prepared according to
the document preparation guidelines given below, to be
automatically typeset using LaTeX, translated into a collection
of World-Wide Web HTML documents, output in Palm Markup
Language to create documents for handheld platforms, or (if
written using ISO 8859 Latin 1 characters), “flattened” into
a 7-bit ASCII document by removing accents from letters and
translating other ISO characters into reasonable ASCII
representations. In addition, various tools are available
to assist editors in preparing electronic texts compatible with
the format used by this program.
OPTIONS
- --ascii-only
- Check for the presence of any characters not part of the
7-bit ASCII set (for example, accented letters belonging to the
ISO 8859-1 set), and generate warning messages identifying
them.
- --babel lang
- Use the LaTeX
babel
package for language
lang.
- --check
- Check text for publication. Report any invalid characters
or formatting errors to standard error.
- --clean
- Clean up text for publication: expand tab characters to
spaces, remove trailing blanks from lines.
- --copyright
- Print copying information.
- --debug-parser file
- Write parser debugging information to
file.
Each line in the body of the text is labeled with the
identification assigned it by the parser.
- --flatten-iso
- ISO 8859-1 8-bit characters are replaced with
their closest 7-bit ASCII equivalent (for example, accented letters
are changed to unaccented characters). This is a
destructive
transformation, and should be performed only when a text must be
displayed on a device which cannot accept 8-bit characters.
- --french-punctuation
- Insert nonbreaking spaces around punctuation
as normally done when typesetting French. Guillemets, colons, semicolons,
question marks, and exclamation points are set off from the adjoining
text by a space. This mode is unnecessary when typesetting
French with the
--babel francais
option.
- --help, -u
- Print how-to-call information including a
list of options.
- --html, -h
- Generate HTML output. By default, a document tree
is generated with an index document which links to individual
chapter documents, each of which contains navigation links. If the
--single-file
option is specified, a single HTML document
containing the entire text is generated. HTML files are written to
the current directory.
- --latex, -l
- Generate a LaTeX file to typeset the document. If
the document is in a language other than English, you may also wish
to use the
--babel
option to invoke formatting appropriate
for the language.
- --match-quotes
- Report possible mismatched double quotes. (Note that multi-paragraph
continued quotes with a quote at the start and no closing quote will
be reported as mismatched by this option.) Quote matching is performed
only when generating an output format which translates double quotes
into opening and closing quotes (--latex, --palm, or
--html with --unicode specified).
- --modest
- Suppress the “Translated by” message in output files.
This is primarily useful when making and checking regression test
output to avoid discrepancies due only to the version number, date, and
time in these messages.
- --palm, -p
- Generate a file in
Palm
Markup Language to create a document for Palm Reader on handheld
platforms.
- ---save-epilogue file
- The document epilogue is written to the
designated file.
- ---save-prologue file
- The document prologue is written to the
designated file.
- --single-file
- Generate a single HTML file containing all chapters,
as opposed to the default of a document tree with a separate file for
each chapter.
- --special-strip
- Remove all format-specific special commands from
the document, and blank lines following special command if they would
result in consecutive blank lines in the document. This option may
be used in conjunction with the --clean option when preparing
a text for publication in "Plain ASCII" format.
- --strict
- Generate HTML compatible with the XHTML 1.0 Strict Document
Type Definition. Note that this will make extensive use of Cascading
Style Sheets (CSS) in the document, which may cause compatibility problems
with older browsers which do not support or incorrectly implement
style specifications.
- --unicode
- Generate XHTML Unicode text entities for characters
(for example, opening and closing quotation marks and dashes)
not present in the ISO-8859 Latin-1 character set.
- --verbose, -v
- Print information regarding processing of the
document, including the number of lines read and written.
- --version
- Print program version information.
FILES
If no infile is specified, etset obtains its
input from standard input; if no outfile is given,
output is sent to standard output. If the --html or
-h option is set, both infile and
outfile must be specified, and outfile is the
"base name" used to create the tree of HTML documents in the
current directory. If infile is -, standard
input is read; if outfile is -, standard output
is written. outfile may not be - when
generating HTML documents.
DOCUMENT PREPARATION
Document translation into LaTeX or HTML assumes the document
has been prepared according to the following guidelines.
- Characters follow the 8-bit ISO 8859/1 Latin 1 character set. ASCII
is a proper subset of this character set, so any "Plain ASCII" file
meets this criterion by definition. The extension to ISO 8859/1 is
required so that Etexts which include the accented characters used by
Western European languages may continue to be "readable by both humans
and computers".
- No white space characters other than blanks and line separators
are used (in particular, tabs are expanded to spaces).
- The text bracket sequence:
<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
appears both before and after the actual body of the
Etext.This allows including an arbitrary prologue and epilogue to the
body of the document.
- Normal body text begins in column 1 and is set ragged right with a
line length of 70 characters. The choice of 70
characters is arbitrary and was made to avoid overly-long
and therefore less readable lines in the Plain Vanilla
text.
- Paragraphs are separated by blank lines.
- Centring, right, and left justification is indicated by actually
so-justifying the text within the 70 character line.
Left justified lines should start in column 2 to avoid
confusion with paragraph body text.
- Block quotations are indented to start in column 5 and set ragged
right with a line length of 60 characters.
- Preformatted tables begin with a line which starts in
column 3 and contains at least one sequence of three or
more spaces between nonblanks. The table is formatted
verbatim until the next blank line.
- Text set in italics is bracketed by underscore characters, "_".
These must match.
- Footnotes are included in-line, bracketed by "[]". The footnote
appears at the point in the copy where the footnote mark
appears in the source text. Footnotes may not be nested and may
consist of only a single paragraph.
- The title is defined as the sequence of lines which appear between
the first text bracket "<><><>..." and a centered line
consisting exclusively of more than two equal signs
"====".
- The author's name is the text which follows the line of equal
signs marking the end of the title and precedes the first
chapter mark. This may be multiple lines.
- Chapters are delimited by a three line sequence of
centred lines:
Chapter number
--------------------
Chapter name
The line of equal signs must be centred and contain
three or more equal signs and no other characters other
than white space. Chapter "numbers" need not be
numeric--they can be any text.
- Dashes in the text are indicated in the normal typewritten text
convention of "--". No hyphenation of words at the end
of lines is done.
- Ellipses are indicated by "..."; sentence-ending ellipses by
"....".
- Greek letters and mathematical symbols are enclosed in the
brackets "\(" and "\)" and are expressed as their
character or symbol names in the LaTeX typesetting
language. For example, write the Greek word for "word"
as:
\( \lambda \acute{o} \gamma o \varsigma \)
and the formula for the roots of a quadratic equation as:
\( x_{1,2} = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} \)
I acknowledge that this provision is controversial. It is as
distasteful to me as I suspect it is to you. In its defence, let me
treat the Greek letter and math formula cases separately. Using LaTeX
encoding for Greek letters is purely a stopgap until Unicode comes
into common use on enough computers so that we can use it for Etexts
which contain characters not in the ASCII or ISO 8859/1 sets (which
are the 7- and 8-bit subsets of Unicode, respectively). If an author
uses a Greek word in the text, we have two ways to proceed in
attempting to meet the condition:
The etext, when displayed, is clearly readable, and
does not contain characters other than those
intended by the author of the work, although....
The first approach is to transliterate into Roman
characters according to a standard table such as that
given in
The
Chicago Manual of Style. This preserves
readability and doesn't require funny encoding, but in a
sense violates the author's "original intent"--the author
could have transliterated the word in the first place but
chose not to. By transliterating we're reversing the
author's decision. The second approach, encoding in
LaTeX or some other markup language, preserves the
distinction that the author wrote the word in Greek and
maintains readability since letters are called out by
their English language names, for the most part. Of
course LaTeX helps us only for Greek (and a few
characters from other languages). If you're faced with
Cyrillic, Arabic, Chinese, Japanese, or other languages
written in non-Roman letters, the only option
(absent Unicode) is to transliterate.
I suggest that encoding mathematical formulas as LaTeX achieves the goal
of "readable by humans" on the strength of LaTeX encoding being widely
used in the physics and mathematics communities when writing formulas
in E-mail and other ASCII media. Just as one is free to to
transliterate Greek in an Etext, one can use ASCII artwork formulas
like:
---------
+ / 2
-b - \/ b - 4ac
x = ------------------
1,2 2a
This is probably a better choice for occasional formulas simple enough
to write out this way. But to produce Etexts of historic scientific
publications such as Einstein's
"Zur Elektrodynamik bewegter Körper"
(the special relativity paper published in
Annalen der Physik
in 1905),
trying to render dozens of complicated equations in ASCII is not
only extremely tedious but in all likelihood counterproductive;
ambiguities in trying to express complex equations would make it
difficult for a reader to determine precisely what Einstein wrote
unless conventions just as complicated (and harder to learn) as those
of LaTeX were adopted for ASCII expression of mathematics. Finally,
the choice of LaTeX encoding is made not only based on its existing
widespread use but because the underlying software that defines it
(TeX and LaTeX) are entirely in the public domain, available in source
code form, implemented on most commonly-available computers, and
frozen by their authors so that, unlike many commercial products, the
syntax is unlikely to change in the future and obsolete current
texts.
- Other punctuation in the text consists only of the characters:
. , : ; ? ! ` ' ( ) { } " + = - / * @ # $ % & ~ ^ | < >
In other words, the characters:
_ [ ] \
are never used except in the special senses defined
above.
- Quote marks may be rendered explicitly as open and quote marks
with the sequences `single quotes' or "double quotes".
As long as quotes are balanced within a paragraph, the
ASCII quote character `"' may be used. Alternate
occurrences of this character will be typeset as opening
and closing quote characters. The open/close quote state
is reset at the start of each paragraph, limiting
the scope of errors to a single paragraph and
permitting "continuation quotes" when multiple
paragraphs are quoted.
BUGS
HTML document trees generated by the --html option typically
require some manual editing to look their best.
Errors in Greek words and mathematical formulas encoded as
LaTeX are not detected by etset and will result in LaTeX
errors when the --latex option output is processed.
When generating HTML files, ISO graphic characters which are
not required to be encoded in the &char; form by the HTML spec
are output in their original 8-bit form. Expanding them to
their &char; equivalents would result in the output being a
pure 7-bit file, but would blow up the output file size
substantially and render it far more difficult to edit by hand.
I am aware of no contemporary Web server, brower, or authoring
tool which cannot correctly process files which include ISO
graphic characters.
Crazy combinations of options may do crazy things.
SEE ALSO
ascii(7),
iso_8859_1(7),
latex(1)
-
etset C++ source code: etset-3.5.tar.gz
-
Read etset source code (requires
PDF reader)
All prior releases remain available.
AUTHOR
John Walker
http://www.fourmilab.ch/
This software is in the public domain. Permission to use, copy,
modify, and distribute this software and its documentation for any
purpose and without fee is hereby granted, without any conditions
or restrictions. This software is provided “as is”
without express or implied warranty.