unum - Interconvert numbers, Unicode, and HTML/XHTML characters

Once upon a time, all the characters you needed when writing documents and programs were right there on the keyboard (at least if you're too young to have used a keypunch and think 11−6−8 every time you see a semicolon), but that was then and this is the globalised, technology saturated twenty-first century, with flying cars, fusion power generators, holidays on the Moon, and instantaneous worldwide communication…oh, wait…well, the last one anyway. When writing for a wired world, one often finds the need to use characters from languages other than one's own, mathematical and scientific symbols, fancy punctuation, printer's ornaments, and other embellishments which give online documents that professional polish.

Usually, this has meant hauling out that monster (1500 page, 3.4 kg) Unicode book, searching for the character that's required, then converting its hexadecimal character code to a decimal or symbolic HTML “character entity”, which means firing up the programmer's calculator and/or another table lookup. Enter unum, a stand-alone utility program written in portable Perl which allows you to look up Unicode and HTML characters by name or number, and interconvert numbers in decimal, hexadecimal, and octal bases.

Downloading and Installation

The unum utility is available in two varieties which may be downloaded from the following links:

The compressed and uncompressed versions of the program work identically. The uncompressed version contains the entire Unicode character database in text form, and is more than 8 megabytes when extracted from the archive. The compressed version uses the same database, but stored compressed with the bzip2 utility; the database is automatically uncompressed when the program is run. The compressed version is a little less than a megabyte in size and loads much faster—even including the time to decompress the database, it usually runs faster than the uncompressed version. The uncompressed version is provided for users of systems which do not support the required bunzip2 utility or a version of Perl which permits binary data embedded in programs.

Each archive contains a single file, unum.pl, which is the self-contained Perl source code for unum. You may wish to rename it unum and install it in a directory on your PATH, but it may be run from any location. The program uses no modules from the Perl library: there are no prerequisites other than Perl itself and bunzip2 for the compressed version. For full functionality, including the ability to display Unicode characters and accept them on the command line, unum requires a release of Perl contemporary with its early 2006 release date. The program has been tested on Perl 5.32.1; if you run it on an earlier version, you may get an error indicating that the “−CA” option in the first line of the program is not implemented. If this specification is removed, the program will still work, but you won't be able to specify Unicode characters on the command line. Unicode command line arguments and terminal output require an operating system and shell which support these features and may not work on your system. You may have to change the character encoding in your terminal program to permit Unicode input and output. The character name and number lookup facilities will work on any system with a vaguely recent version of Perl.

Manual Page

You can print the following documentation directly from the unum.pl program with the command "perldoc unum.pl". Calling the program with no arguments will print a short summary of argument formats.

Name

unum — Interconvert numbers, Unicode, and HTML/XHTML characters

Synopsis

unum argument…

Description

The unum program is a command line utility which allows you to convert decimal, octal, hexadecimal, and binary numbers; Unicode character and block names; and HTML/XHTML character entity names into one another. It can be used as an on-line special character reference for Web authors.

Arguments

The command line may contain any number of the following forms of argument.

123: Decimal number.
0371: Octal number preceded by a zero.
0x1D351: Hexadecimal number preceded by 0x. Letters may be upper or lower case, but the x must be lower case.
0b110101: Binary number.
b=block: Unicode character blocks matching block are listed. The block specification may be a regular expression. For example, b=greek lists all Greek character blocks in Unicode.
c=char…: The Unicode character codes for the characters char… are printed. If the first character is not a decimal digit and the second not an equal sign, the c= may be omitted.
h=entity: List all HTML/XHTML character references matching entity, which may be a regular expression. Matching is case-insensitive, so h=alpha finds both Α and α. If the reference is composed of multiple Unicode code points, the components are printed after the name of the composed character reference.
'&#number;&#xhexnum;…': List the characters corresponding to the specified HTML/XHTML character entities, which may be given in either decimal or hexadecimal. Note that the “x” in XHTML entities must be lower case. On most Unix-like operating systems, you'll need to quote the argument so the ampersand, octothorpe, and semicolon aren't interpreted by the shell.
l=block: List all Unicode blocks matching block and all characters within each block; l=goth lists the Gothic block and the 32 characters it contains.
n=name: List all Unicode character whose names match name, which may be a regular expression. For example, n=telephone finds the twelve Unicode characters for telephone symbols.
utf8=number: Treating the number (which may be specified as either decimal, octal, hexadecimal, or binary, as for numeric arguments) as a stream of from one to four bytes, decode the bytes as the UTF-8 representation of a character. For example, utf8=0xE298A2 decodes to Unicode code point 0x2622, the radioactive sign.

Options

--nent: When showing an HTML character reference, always show its numerical form (for example, —), even if it has a named character reference.
--utf8: Show UTF-8 encoding of characters as a byte sequence in a hexadecimal number. This is the same format as is accepted by the utf8= argument. The option applies to the display of all arguments which follow on the command line.

Output

For number or character arguments, the value(s) are listed in all of the input formats, save binary.

   Octal  Decimal      Hex        HTML    Character   Unicode
     056       46     0x2E    &period;    "."         FULL STOP

If the terminal font cannot display the character being listed, the "Character" field will contain whatever default is shown in such circumstances. Control characters are shown as a Perl hexadecimal escape. If multiple HTML named character references map to the same Unicode code point, all are shown separated by commas.

Unicode blocks are listed as follows:

    Start        End  Unicode Block
   U+2460 -   U+24FF  Enclosed Alphanumerics
  U+1D400 -  U+1D7FF  Mathematical Alphanumeric Symbols

Version

This is unum version 3.6-15.1.0, released on October 21st, 2023.

Bugs

Specification of Unicode characters on the command line requires an operating system and shell which support that feature and a version of Perl with the −CA command line option (v5.8.5 has it, but v5.8.0 does not; I don't know in which intermediate release it was introduced). If your version of Perl does not implement this switch, you'll have to remove it from the #! statement at the top of the program, and Unicode characters on the command line will not be interpreted correctly.

If you specify a regular expression, be sure to quote the argument if it contains any characters the shell would otherwise interpret.

If you run perldoc on the compressed version of the program, a large amount of gibberish will be displayed after the end of the embedded documentation. perldoc gets confused by sequences in the compressed data table and tries to interpret it as documentation. This doesn't happen with the uncompressed version.

Please report any bugs to bugs@fourmilab.ch.

Copyright

This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The Unicode character tables are based upon the Unicode 15.1.0 (September 2023) standard.

The control characters in this unum version have been annotated with their Unicode abbreviations, names, and for U+0000 to U+001F, the Ctrl-key code which generates them.

The HTML named character references are from the World Wide Web Consortium HTML standard. Some browsers may not support all of these references.

Interconvert numbers, Unicode, and HTML/XHTML entities