ETSET: Etext to ASCII, LaTeX, or HTML

This page describes, in Unix manual page style, a C program available for downloading from this site in either Gzipped TAR or Zip form, which allows you to translate human-readable electronic texts into a variety of forms for electronic publication.

NAME

etset - translate ISO 8859 Etext to ASCII, LaTeX, or HTML

SYNOPSIS

etset [ -a -f -h -l -u -w ] [ infile [ outfile ] ]

DESCRIPTION

etset allows human readable electronic books, prepared according to the document preparation guidelines given below, to be automatically typeset using LaTeX, translated into a collection of World-Wide Web HTML documents, or (if written using ISO 8859 Latin 1 characters), "flattened" into a 7-bit ASCII document by removing accents from letters and translating other ISO characters into reasonable ASCII representations.

OPTIONS

-a: Flatten ISO characters in the document to 7-bit ASCII. Accents are discarded from letters, and other ISO graphic characters are translated to ASCII replacements. When the -a option is specified, the entire input stream is processed, not just the portion marked by text brackets. This permits etset to be used as an ISO to ASCII filter for files in any format.
-f: Process the document assuming it employs the French punctuation style where spaces are set before the punctuation marks : ; ? ! » and after «. This option causes etset, when generating LaTeX or HTML output, to emit a non-breaking space between the punctuation mark and the adjacent word, preventing line breaking which might divide the word and punctuation mark. etset assumes the input text does not contain any such incorrect line breaks. The -f option has no effect if used in conjunction with the -a option.
-h: Generates a set of World-Wide Web HTML files in the directory specified by outfile. When the -h option is given, both infile and outfile must be specified, and the outfile directory must be created prior to running etset. A root document named outfile/outfile.html will contain the table of contents for the document, and separate HTML files will be generated for each chapter and the footnotes (if any). Navigation icons are automatically placed in the outfile directory.
-l: Generate a LaTeX file, suitable for typesetting with latex. This is the default if no output format options are given. The output is compatible with LaTeX 2e and later.
-u: Print how-to-call information.
-w: Embed warnings in the output document if non-ISO characters are encountered in the input. Non-ISO characters consist of non-whitespace ASCII control characters (decimal 0 through 31), the DEL character (decimal 127), and the ISO reserved control characters (128 through 159). This option is particularly useful when processing documents created with packages such as Microsoft Word, which use proprietary codes outside the ISO range for various regrettable features. If the -w option is not set, non-ISO characters are silently discarded.

FILES

If no infile is specified, etset obtains its input from standard input; if no outfile is given, output is sent to standard output. If the -h option is set, both infile and outfile must be specified, and outfile must give the name of a previously created subdirectory in the directory in which etset is being run.

DOCUMENT PREPARATION

Document translation into LaTeX or HTML assumes the document has been prepared according to the following guidelines.

Characters follow the 8-bit ISO 8859/1 Latin 1 character set. ASCII is a proper subset of this character set, so any "Plain ASCII" file meets this criterion by definition. The extension to ISO 8859/1 is required so that Etexts which include the accented characters used by Western European languages may continue to be "readable by both humans and computers".
No white space characters other than blanks and line separators are used (in particular, tabs are expanded to spaces).
The text bracket sequence:
<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
appears both before and after the actual body of the Etext. This allows including an arbitrary prefix and postfix to the body of the document.
Normal body text begins in column 1 and is set ragged right with a line length of 70 characters. The choice of 70 characters is arbitrary and was made to avoid overly-long and therefore less readable lines in the Plain Vanilla text.
Paragraphs are separated by blank lines.
Centering, right, and left justification is indicated by actually so-justifying the text within the 70 character line. Left justified lines should start in column 2 to avoid confusion with paragraph body text.
Block quotations are indented to start in column 5 and set ragged right with a line length of 60 characters.
Text set in italics is bracketed by underscore characters, "_". These must match.
Footnotes are included in-line, bracketed by "[]". The footnote appears at the point in the copy where the footnote mark appears in the source text.
The title is defined as the sequence of lines which appear between the first text bracket "<><><>..." and a centered line consisting exclusively of more than two equal signs "====".
The author's name is the text which follows the line of equal signs marking the end of the title and precedes the first chapter mark. This may be multiple lines.
Chapters are delimited by a three line sequence of centered lines:
```
Chapter number
--------------------
Chapter name
```
The line of equal signs must be centered and contain three or more equal signs and no other characters other than white space. Chapter "numbers" need not be numeric--they can be any text. Documents without chapter breaks should contain an initial chapter mark following the title with Chapter number of "*" and a blank Chapter name.
Dashes in the text are indicated in the normal typewritten text convention of "--". No hyphenation of words at the end of lines is done.
Ellipses are indicated by "..."; sentence-ending ellipses by "....".
Greek letters and mathematical symbols are enclosed in the brackets "$" and "$" and are expressed as their character or symbol names in the LaTeX typesetting language. For example, write the Greek word for "word" as:
$ \lambda \acute{o} \gamma o \varsigma $
and the formula for the roots of a quadratic equation as:
$ x_{1,2} = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} $
(Note: I acknowledge that this provision is controversial. It is as distasteful to me as I suspect it is to you. In its defence, let me treat the Greek letter and math formula cases separately. Using LaTeX encoding for Greek letters is purely a stopgap until Unicode comes into common use on enough computers so that we can use it for Etexts which contain characters not in the ASCII or ISO 8859/1 sets (which are the 7- and 8-bit subsets of Unicode, respectively). If an author uses a Greek word in the text, we have two ways to proceed in attempting to meet the condition:

The etext, when displayed, is clearly readable, and does not contain characters other than those intended by the author of the work, although....

The first approach is to transliterate into Roman characters according to a standard table such as that given in The Chicago Manual of Style. This preserves readability and doesn't require funny encoding, but in a sense violates the author's "original intent"--the author could have transliterated the word in the first place but chose not to. By transliterating we're reversing the author's decision. The second approach, encoding in LaTeX or some other markup language, preserves the distinction that the author wrote the word in Greek and maintains readability since letters are called out by their English language names, for the most part. Of course LaTeX helps us only for Greek (and a few characters from other languages). If you're faced with Cyrillic, Arabic, Chinese, Japanese, or other languages written in non-Roman letters, the only option (pre-Unicode) is to transliterate.
I argue that encoding mathematical formulas as LaTeX achieves the goal of "readable by humans" on the strength of LaTeX encoding being widely used in the physics and mathematics communities when writing formulas in E-mail and other ASCII media. Just as one is free to to transliterate Greek in an Etext, one can use ASCII artwork formulas like:
```
                              ---------
                         +   /  2
                      -b - \/  b  - 4ac
            x     =  ------------------
             1,2            2a
```
This is probably a better choice for occasional formulas simple enough to write out this way. But to produce Etexts of historic scientific publications such as Einstein's "Zur Elektrodynamik bewegter Körper" (the special relativity paper published in Annalen der Physik in 1905), trying to render the hundreds of complicated equations in ASCII is not only extremely tedious but in all likelihood counterproductive; ambiguities in trying to render complex equations would make it difficult for a reader to determine precisely what Einstein wrote unless conventions just as complicated (and harder to learn) as those of LaTeX were adopted for ASCII expression of mathematics. Finally, the choice of LaTeX encoding is made not only based on its existing widespread use but because the underlying software that defines it (TeX and LaTeX) are entirely in the public domain, available in source code form, implemented on most commonly-available computers, and frozen by their authors so that, unlike many commercial products, the syntax is unlikely to change in the future and obsolete current texts).
Other punctuation in the text consists only of the characters:
. , : ; ? ! ` ' ( ) { } " + = - / * @ # $ % & ~ ^ | < >
In other words, the characters:
_ [ ] \
are never used except in the special senses defined above.
Quote marks may be rendered explicitly as open and quote marks with the sequences `single quotes' or "double quotes". As long as quotes are balanced within a paragraph, the ASCII quote character `"' may be used. Alternate occurrences of this character will be typeset as opening and closing quote characters. The open/close quote state is reset at the start of each paragraph, limiting the scope of errors to a single paragraph.

BUGS

HTML document trees generated by the -h option typically require some manual editing to look their best.

Errors in Greek words and mathematical formulas encoded as LaTeX are not detected by etset and will result in LaTeX errors when the -l option output is processed.

When generating HTML files, ISO graphic characters which are not required to be encoded in the &char; form by the HTML spec are output in their original 8-bit form. Expanding them to their &char; equivalents would result in the output being a pure 7-bit file, but would blow up the output file size substantially and render it far more difficult to edit by hand. I am aware of no contemporary Web server, brower, or authoring tool which cannot correctly process files which include ISO graphic characters.

The structure of the program resembles an inelastic collision of three separate programs for ASCII, LaTeX, and HTML output, for the excellent reason that etset is precisely that. While this results in substantial duplication of code, it does mean that changes in the code for a given format are less likely to break one of the other output types.

Download etset source code: etset.tar.gz or etset.zip

AUTHOR

John Walker
http://www.fourmilab.ch/

This software is in the public domain. Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, without any conditions or restrictions. This software is provided "as is" without express or implied warranty.