# Interconvert numbers, Unicode, and HTML/XHTML entities

Once upon a time, all the characters you needed when writing documents and programs were right there on the keyboard (at least if you're too young to have used a keypunch and think 11−6−8 every time you see a semicolon), but that was then and this is the globalised, technology saturated twenty-first century, with flying cars, fusion power generators, holidays on the Moon, and instantaneous worldwide communication…oh, wait…well, the last one anyway. When writing for a wired world, one often finds the need to use characters from languages other than one's own, mathematical and scientific symbols, fancy punctuation, printer's ornaments, and other embellishments which give online documents that professional polish.

Usually, this has meant hauling out that monster (1500 page, 3.4 kg) Unicode book, searching for the character that's required, then converting its hexadecimal character code to a decimal or symbolic HTML “character entity”, which means firing up the programmer's calculator and/or another table lookup. Enter unum, a stand-alone utility program written in portable Perl which allows you to look up Unicode and HTML characters by name or number, and interconvert numbers in decimal, hexadecimal, and octal bases.

Here are some everyday questions you can easily answer with unum.

Browser and system font support for Unicode is spotty at the present time. Some of the characters in the following examples may display as square boxes, blanks, question marks, or whatever other symbol your system uses for characters it cannot properly display.
\$ unum 65261 Octal Decimal Hex HTML Character Unicode 0177355 65261 0xFEED &#65261; "ﻭ" ARABIC LETTER WAW ISOLATED FORM
A Perl string contains "\x{2622}". What character is that?
\$ unum 0x2622 Octal Decimal Hex HTML Character Unicode 023042 9762 0x2622 &#9762; "☢" RADIOACTIVE SIGN
What's the character code for Control-Q?
\$ unum n=ctrl-q Octal Decimal Hex HTML Character Unicode 021 17 0x11 &#17; "\x{11}" <control> DC1 DEVICE CONTROL ONE Ctrl-Q
What are the HTML entity codes for mathematical integral signs?
\$ unum n=integral Octal Decimal Hex HTML Character Unicode 021053 8747 0x222B &int; "∫" INTEGRAL 021054 8748 0x222C &#8748; "∬" DOUBLE INTEGRAL 021055 8749 0x222D &#8749; "∭" TRIPLE INTEGRAL 021056 8750 0x222E &#8750; "∮" CONTOUR INTEGRAL Additional output elided.
What HTML entities are defined for quotation marks?
\$ unum h=quo Octal Decimal Hex HTML Character Unicode 042 34 0x22 &quot; """ QUOTATION MARK 0253 171 0xAB &laquo; "«" LEFT-POINTING DOUBLE ANGLE QUOTATION MARK 0273 187 0xBB &raquo; "»" RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK 020030 8216 0x2018 &lsquo; "‘" LEFT SINGLE QUOTATION MARK 020031 8217 0x2019 &rsquo; "’" RIGHT SINGLE QUOTATION MARK Additional output elided.
Which Unicode blocks contain Greek characters?
\$ unum b=greek Start End Unicode Block U+0370 - U+03FF Greek and Coptic U+1F00 - U+1FFF Greek Extended U+10140 - U+1018F Ancient Greek Numbers U+1D200 - U+1D24F Ancient Greek Musical Notation
What are all the character blocks in Unicode?
\$ unum b=. Start End Unicode Block U+0000 - U+007F Basic Latin U+0080 - U+00FF Latin-1 Supplement U+0100 - U+017F Latin Extended-A U+0180 - U+024F Latin Extended-B U+0250 - U+02AF IPA Extensions Output elided. U+2F800 - U+2FA1F CJK Compatibility Ideographs Supplement U+E0000 - U+E007F Tags U+E0100 - U+E01EF Variation Selectors Supplement U+F0000 - U+FFFFF Supplementary Private Use Area-A U+100000 - U+10FFFF Supplementary Private Use Area-B
What characters are available for the Thai language?
\$ unum l=thai Start End Unicode Block U+0E00 - U+0E7F Thai Octal Decimal Hex HTML Character Unicode 07000 3584 0xE00 &#3584; "฀" Thai U+00E00 07001 3585 0xE01 &#3585; "ก" THAI CHARACTER KO KAI 07002 3586 0xE02 &#3586; "ข" THAI CHARACTER KHO KHAI 07003 3587 0xE03 &#3587; "ฃ" THAI CHARACTER KHO KHUAT 07004 3588 0xE04 &#3588; "ค" THAI CHARACTER KHO KHWAI Additional output elided.
Here's the Russian word “Правда” in a Web page. What's the HTML code to use it in my own page?
\$ unum c=Правда Octal Decimal Hex HTML Character Unicode 02037 1055 0x41F &#1055; "П" CYRILLIC CAPITAL LETTER PE 02100 1088 0x440 &#1088; "р" CYRILLIC SMALL LETTER ER 02060 1072 0x430 &#1072; "а" CYRILLIC SMALL LETTER A 02062 1074 0x432 &#1074; "в" CYRILLIC SMALL LETTER VE 02064 1076 0x434 &#1076; "д" CYRILLIC SMALL LETTER DE 02060 1072 0x430 &#1072; "а" CYRILLIC SMALL LETTER A
A Web page contains a sequence of character entities for an Arabic name. What is it?
\$ unum '&#1575;&#1604;&#1593;&#1585;&#1576;&#1610;&#1577;' Octal Decimal Hex HTML Character Unicode 03047 1575 0x627 &#1575; "ا" ARABIC LETTER ALEF 03104 1604 0x644 &#1604; "ل" ARABIC LETTER LAM 03071 1593 0x639 &#1593; "ع" ARABIC LETTER AIN 03061 1585 0x631 &#1585; "ر" ARABIC LETTER REH 03050 1576 0x628 &#1576; "ب" ARABIC LETTER BEH 03112 1610 0x64A &#1610; "ي" ARABIC LETTER YEH 03051 1577 0x629 &#1577; "ة" ARABIC LETTER TEH MARBUTA “al-'Arabiyah”—Arabia

unum.tar.gz: Gzipped TAR archive (112 Kb)

The archive contains a single file, unum.pl, which is the self-contained Perl source code for unum. You may wish to rename it unum and install it in a directory on your PATH, but it may be run from any location. The program uses no modules from the Perl library: there are no prerequisites other than Perl itself. For full functionality, including the ability to display Unicode characters and accept them on the command line, unum requires a release of Perl contemporary with its early 2006 release date. The program has been tested on Perl 5.8.5; if you run it on an earlier version, you may get an error indicating that the "−CA" option in the first line of the program is not implemented. If this specification is removed, the program will still work, but you won't be able to specify Unicode characters on the command line. Unicode command line arguments and terminal output require an operating system and shell which support these features and may not work on your system: for example, everything works fine with Perl 5.8.5 on Fedora Core 3 Linux, but neither Unicode arguments nor Unicode output from shell programs are implemented on Red Hat Enterprise Linux 3. You may have to change the character encoding in your terminal program to permit Unicode input and output; with Gnome Terminal, for example, select the Terminal / Character Encoding / Unicode menu item. The character name and number lookup facilities will work on any system with a vaguely recent version of Perl.

All prior releases remain available.

## Manual Page

You can print the following documentation directly from the unum.pl program with the command "perldoc unum.pl". Calling the program with no arguments will print a short summary of argument formats.

## Name

unum — Interconvert numbers, Unicode, and HTML/XHTML characters

unum argument

## Description

The unum program is a command line utility which allows you to convert decimal, octal, hexadecimal, and binary numbers; Unicode character and block names; and HTML/XHTML character entity names into one another. It can be used as an on-line special character reference for Web authors.

### Arguments

The command line may contain any number of the following forms of argument.

123
Decimal number.
0371
Octal number preceded by a zero.
0x1D351
Hexadecimal number preceded by `0x`. Letters may be upper or lower case, but the `x` must be lower case.
0b110101
Binary number.
b=block
Unicode character blocks matching block are listed. The block specification may be a regular expression. For example, `b=greek` lists all Greek character blocks in Unicode.
c=char
The Unicode character codes for the characters char… are printed. If the first character is not a decimal digit and the second not an equal sign, the `c=` may be omitted.
h=entity
List all HTML/XHTML character entities matching entity, which may be a regular expression. Matching is case-insensitive, so `h=alpha` finds both `&Alpha;` and `&alpha;`.
'&#number;&#xhexnum;…'
List the characters corresponding to the specified HTML/XHTML character entities, which may be given in either decimal or hexadecimal. Note that the “x” in XHTML entities must be lower case. On most Unix-like operating systems, you'll need to quote the argument so the ampersand, octothorpe, and semicolon aren't interpreted by the shell.
l=block
List all Unicode blocks matching block and all characters within each block; `l=goth` lists the `Gothic` block and the 32 characters it contains.
n=name
List all Unicode character whose names match name, which may be a regular expression. For example, `n=telephone` finds the five Unicode characters for telephone symbols.

### Output

For number or character arguments, the value(s) are listed in all of the input formats, save binary.

```   Octal  Decimal      Hex        HTML    Character   Unicode
046       38     0x26       &amp;    "&"         AMPERSAND
```

If the terminal font cannot display the character being listed, the "Character" field will contain whatever default is shown in such circumstances. Control characters are shown as a Perl hexadecimal escape.

Unicode blocks are listed as follows:

```    Start        End  Unicode Block
U+2460 -   U+24FF  Enclosed Alphanumerics
U+1D400 -  U+1D7FF  Mathematical Alphanumeric Symbols
```

## Version

This is unum version 1.1, released on February 11th, 2006.

## Bugs

Specification of Unicode characters on the command line requires an operating system and shell which support that feature and a version of Perl with the −CA command line option (v5.8.5 has it, but v5.8.0 does not; I don't know in which intermediate release it was introduced). If your version of Perl does not implement this switch, you'll have to remove it from the `#!` statement at the top of the program, and Unicode characters on the command line will not be interpreted correctly.

If you specify a regular expression, be sure to quote the argument if it contains any characters the shell would otherwise interpret.

Please report any bugs to bugs@fourmilab.ch.

This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The Unicode character tables are derived from the Unicode::CharName module:

The Unicode::CharName library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

```    http://search.cpan.org/~gaas/Unicode-String-2.09/lib/Unicode/CharName.pm