ETSET: Etext to ASCII, LaTeX, or HTML

This page describes, in Unix manual page style, a C++ program available for downloading from this site which allows you to translate human-readable electronic texts into a variety of forms for electronic publication.

NAME

etset - translate ISO 8859 Etext to LaTeX, HTML, PML, or ASCII

SYNOPSIS

etset [ options ] [ infile [ outfile ] ]

DESCRIPTION

etset allows human readable electronic books, prepared according to the document preparation guidelines given below, to be automatically typeset using LaTeX, translated into a collection of World-Wide Web HTML documents, output in Palm Markup Language to create documents for handheld platforms, or (if written using ISO 8859 Latin 1 characters), “flattened” into a 7-bit ASCII document by removing accents from letters and translating other ISO characters into reasonable ASCII representations. In addition, various tools are available to assist editors in preparing electronic texts compatible with the format used by this program.

OPTIONS

--ascii-only: Check for the presence of any characters not part of the 7-bit ASCII set (for example, accented letters belonging to the ISO 8859-1 set), and generate warning messages identifying them.
--babel lang: Use the LaTeX babel package for language lang.
--check: Check text for publication. Report any invalid characters or formatting errors to standard error.
--clean: Clean up text for publication: expand tab characters to spaces, remove trailing blanks from lines.
--copyright: Print copying information.
--debug-parser file: Write parser debugging information to file. Each line in the body of the text is labeled with the identification assigned it by the parser.
--flatten-iso: ISO 8859-1 8-bit characters are replaced with their closest 7-bit ASCII equivalent (for example, accented letters are changed to unaccented characters). This is a destructive transformation, and should be performed only when a text must be displayed on a device which cannot accept 8-bit characters.
--french-punctuation: Insert nonbreaking spaces around punctuation as normally done when typesetting French. Guillemets, colons, semicolons, question marks, and exclamation points are set off from the adjoining text by a space. This mode is unnecessary when typesetting French with the --babel francais option.
--help, -u: Print how-to-call information including a list of options.
--html, -h: Generate HTML output. By default, a document tree is generated with an index document which links to individual chapter documents, each of which contains navigation links. If the --single-file option is specified, a single HTML document containing the entire text is generated. HTML files are written to the current directory.
--latex, -l: Generate a LaTeX file to typeset the document. If the document is in a language other than English, you may also wish to use the --babel option to invoke formatting appropriate for the language.
--match-quotes: Report possible mismatched double quotes. (Note that multi-paragraph continued quotes with a quote at the start and no closing quote will be reported as mismatched by this option.) Quote matching is performed only when generating an output format which translates double quotes into opening and closing quotes (--latex, --palm, or --html with --unicode specified).
--modest: Suppress the “Translated by” message in output files. This is primarily useful when making and checking regression test output to avoid discrepancies due only to the version number, date, and time in these messages.
--palm, -p: Generate a file in Palm Markup Language to create a document for Palm Reader on handheld platforms.
---save-epilogue file: The document epilogue is written to the designated file.
---save-prologue file: The document prologue is written to the designated file.
--single-file: Generate a single HTML file containing all chapters, as opposed to the default of a document tree with a separate file for each chapter.
--special-strip: Remove all format-specific special commands from the document, and blank lines following special command if they would result in consecutive blank lines in the document. This option may be used in conjunction with the --clean option when preparing a text for publication in "Plain ASCII" format.
--strict: Generate HTML compatible with the XHTML 1.0 Strict Document Type Definition. Note that this will make extensive use of Cascading Style Sheets (CSS) in the document, which may cause compatibility problems with older browsers which do not support or incorrectly implement style specifications.
--unicode: Generate XHTML Unicode text entities for characters (for example, opening and closing quotation marks and dashes) not present in the ISO-8859 Latin-1 character set.
--verbose, -v: Print information regarding processing of the document, including the number of lines read and written.
--version: Print program version information.

FILES

If no infile is specified, etset obtains its input from standard input; if no outfile is given, output is sent to standard output. If the --html or -h option is set, both infile and outfile must be specified, and outfile is the "base name" used to create the tree of HTML documents in the current directory. If infile is -, standard input is read; if outfile is -, standard output is written. outfile may not be - when generating HTML documents.

DOCUMENT PREPARATION

Document translation into LaTeX or HTML assumes the document has been prepared according to the following guidelines.

Characters follow the 8-bit ISO 8859/1 Latin 1 character set. ASCII is a proper subset of this character set, so any "Plain ASCII" file meets this criterion by definition. The extension to ISO 8859/1 is required so that Etexts which include the accented characters used by Western European languages may continue to be "readable by both humans and computers".
No white space characters other than blanks and line separators are used (in particular, tabs are expanded to spaces).
The text bracket sequence:
<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
appears both before and after the actual body of the Etext.This allows including an arbitrary prologue and epilogue to the body of the document.
Normal body text begins in column 1 and is set ragged right with a line length of 70 characters. The choice of 70 characters is arbitrary and was made to avoid overly-long and therefore less readable lines in the Plain Vanilla text.
Paragraphs are separated by blank lines.
Centring, right, and left justification is indicated by actually so-justifying the text within the 70 character line. Left justified lines should start in column 2 to avoid confusion with paragraph body text.
Block quotations are indented to start in column 5 and set ragged right with a line length of 60 characters.
Preformatted tables begin with a line which starts in column 3 and contains at least one sequence of three or more spaces between nonblanks. The table is formatted verbatim until the next blank line.
Text set in italics is bracketed by underscore characters, "_". These must match.
Footnotes are included in-line, bracketed by "[]". The footnote appears at the point in the copy where the footnote mark appears in the source text. Footnotes may not be nested and may consist of only a single paragraph.
The title is defined as the sequence of lines which appear between the first text bracket "<><><>..." and a centered line consisting exclusively of more than two equal signs "====".
The author's name is the text which follows the line of equal signs marking the end of the title and precedes the first chapter mark. This may be multiple lines.
Chapters are delimited by a three line sequence of centred lines:
```
Chapter number
--------------------
Chapter name
```
The line of equal signs must be centred and contain three or more equal signs and no other characters other than white space. Chapter "numbers" need not be numeric--they can be any text.
Dashes in the text are indicated in the normal typewritten text convention of "--". No hyphenation of words at the end of lines is done.
Ellipses are indicated by "..."; sentence-ending ellipses by "....".
Greek letters and mathematical symbols are enclosed in the brackets "$" and "$" and are expressed as their character or symbol names in the LaTeX typesetting language. For example, write the Greek word for "word" as:
$ \lambda \acute{o} \gamma o \varsigma $

and the formula for the roots of a quadratic equation as:

$ x_{1,2} = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} $

I acknowledge that this provision is controversial. It is as distasteful to me as I suspect it is to you. In its defence, let me treat the Greek letter and math formula cases separately. Using LaTeX encoding for Greek letters is purely a stopgap until Unicode comes into common use on enough computers so that we can use it for Etexts which contain characters not in the ASCII or ISO 8859/1 sets (which are the 7- and 8-bit subsets of Unicode, respectively). If an author uses a Greek word in the text, we have two ways to proceed in attempting to meet the condition:

The etext, when displayed, is clearly readable, and does not contain characters other than those intended by the author of the work, although....

The first approach is to transliterate into Roman characters according to a standard table such as that given in The Chicago Manual of Style. This preserves readability and doesn't require funny encoding, but in a sense violates the author's "original intent"--the author could have transliterated the word in the first place but chose not to. By transliterating we're reversing the author's decision. The second approach, encoding in LaTeX or some other markup language, preserves the distinction that the author wrote the word in Greek and maintains readability since letters are called out by their English language names, for the most part. Of course LaTeX helps us only for Greek (and a few characters from other languages). If you're faced with Cyrillic, Arabic, Chinese, Japanese, or other languages written in non-Roman letters, the only option (absent Unicode) is to transliterate.

I suggest that encoding mathematical formulas as LaTeX achieves the goal of "readable by humans" on the strength of LaTeX encoding being widely used in the physics and mathematics communities when writing formulas in E-mail and other ASCII media. Just as one is free to to transliterate Greek in an Etext, one can use ASCII artwork formulas like:
```
                              ---------
                         +   /  2
                      -b - \/  b  - 4ac
            x     =  ------------------
             1,2            2a
```
This is probably a better choice for occasional formulas simple enough to write out this way. But to produce Etexts of historic scientific publications such as Einstein's "Zur Elektrodynamik bewegter Körper" (the special relativity paper published in Annalen der Physik in 1905), trying to render dozens of complicated equations in ASCII is not only extremely tedious but in all likelihood counterproductive; ambiguities in trying to express complex equations would make it difficult for a reader to determine precisely what Einstein wrote unless conventions just as complicated (and harder to learn) as those of LaTeX were adopted for ASCII expression of mathematics. Finally, the choice of LaTeX encoding is made not only based on its existing widespread use but because the underlying software that defines it (TeX and LaTeX) are entirely in the public domain, available in source code form, implemented on most commonly-available computers, and frozen by their authors so that, unlike many commercial products, the syntax is unlikely to change in the future and obsolete current texts.
Other punctuation in the text consists only of the characters:
. , : ; ? ! ` ' ( ) { } " + = - / * @ # $ % & ~ ^ | < >

In other words, the characters:

_ [ ] \
are never used except in the special senses defined above.
Quote marks may be rendered explicitly as open and quote marks with the sequences `single quotes' or "double quotes". As long as quotes are balanced within a paragraph, the ASCII quote character `"' may be used. Alternate occurrences of this character will be typeset as opening and closing quote characters. The open/close quote state is reset at the start of each paragraph, limiting the scope of errors to a single paragraph and permitting "continuation quotes" when multiple paragraphs are quoted.

BUGS

HTML document trees generated by the --html option typically require some manual editing to look their best.

Errors in Greek words and mathematical formulas encoded as LaTeX are not detected by etset and will result in LaTeX errors when the --latex option output is processed.

When generating HTML files, ISO graphic characters which are not required to be encoded in the &char; form by the HTML spec are output in their original 8-bit form. Expanding them to their &char; equivalents would result in the output being a pure 7-bit file, but would blow up the output file size substantially and render it far more difficult to edit by hand. I am aware of no contemporary Web server, brower, or authoring tool which cannot correctly process files which include ISO graphic characters.

Crazy combinations of options may do crazy things.

Download

: etset C++ source code: etset-3.5.tar.gz
: Read etset source code (requires PDF reader)

All prior releases remain available.

AUTHOR

John Walker
http://www.fourmilab.ch/

This software is in the public domain. Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, without any conditions or restrictions. This software is provided “as is” without express or implied warranty.

Automatic Typesetting
of Electronic Texts

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

FILES

DOCUMENT PREPARATION

BUGS

SEE ALSO

Download

AUTHOR

Automatic Typesetting of Electronic Texts

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

FILES

DOCUMENT PREPARATION

BUGS

SEE ALSO

Download

AUTHOR

Automatic Typesetting
of Electronic Texts