Development Details

Contents   Sections   Search   Help

Source document

I started with a flat ASCII text version of the Internal Revenue Code obtained from the U.S. House of Representatives Office of the Law Revision Counsel, current up to January 19th, 2004, corresponding to Supplement III of the 2000 edition of the U.S. Code.

Compilation into HTML for the Web

I wrote an incredibly messy and inefficient C program to compile the text into XHTML 1.0, creating the master table of contents and section index. Code Sections which contain statute text are parsed to extract the section hierarchy, and cross-references between sections are linked, wherever possible, so that clicking on a cross-reference displays the cited text.

Compilation is a three-phase process. The first pass reads the ASCII document, extracts the section hierarchy, assigns HTML file names, and stores the text in memory by component. The second pass reads the statute text to identify the hierarchy of statute text within sections, assigns names to anchors (link targets), and builds a table of anchors. The final pass creates the actual HTML file for each section, scanning the text for cross-references and inserting links to the anchors assigned in the second pass. The program logic in the second and third passes is supplemented by a "hints" file identifying more than 700 occurrences of improper formatting in the Code, allowing the program to correctly parse text it would otherwise misinterpret.

Building the full-text search database

A modified version of the compiler program was used to create a second copy of the Code which was indexed with freeWAIS-sf, with the title of each section containing the name of the compiled HTML document. A modified version of Ulrich Pfeifer's SFgate interface between the Web and WAIS uses the Common Gateway Interface (CGI) to submit queries to the freeWAIS-sf search server and constructs the reply document from the items returned. Text retrieval is not done through WAIS; WAIS serves solely as a search tool which points to Web documents.

Back to U.S. Tax Code On-Line

Valid XHTML 1.0
by John Walker