by John Walker
Once you have a data network which interconnects computers, there is scarcely anything more obvious to do with it than download and upload files—take data there and copy it here and vice versa. It's not surprising, then, that one of the first protocols defined in the infancy of the ARPANET, ancestor of the Internet, was the original File Transfer Protocol, promulgated in RFC 114 on April 16th, 1971 (the current version of this protocol is defined in RFC 959 of October 1985). More than thirty-five years later, downloading (and to a smaller extent uploading) files remains one of the essential applications on the now-global Internet; users require a way to reliably transfer files among hosts on the network, regardless of their size, content, and the details of the machines at either end of the network connection.
Given how essential this capability is to the experience of the user on the Internet, and how simple it is to get right, one is not surprised that Microsoft have completely bungled its implementation in their attempt at a Web browser: Microsoft Internet Explorer 6. Managers of Web sites have become accustomed to complaints by users who download files from their sites with this piece of…software that the files they receive are “corrupted”, with users often adding that they attempted the download several times and received corrupted results on each occasion. As is usually the case, the “corruption” about which the user is complaining is not to be found on the Web site, nor in the Internet infrastructure over which the file was transferred, but in Redmond, Washington, where Microsoft develops a Web browser which fails regularly at the most simple-minded and straightforward of network applications: downloading a file.
Let's take a look at the details of one such report. I have investigated several such incidents over the last year, and they all share the characteristics of the following example.
The user reporting the “corrupted files”, in this case Home Planet distribution archives, is running on Windows 2000 with Microsoft Internet Explorer 6, both at current service pack and patch levels. No error message was reported by Internet Explorer for the download of the files, yet WinZip reported the files as invalid archives. I requested the user send me the files he had downloaded as attachments to an E-mail message so that I could compare them with the files on the server (which, I had verified, were not corrupted in any way). When I received the files and examined them, both turned out to be truncated copies of the files on my server. In each of the two files, the portion received by the user was identical to the initial part of the file on the server, but the file received was incomplete: in one case 3027188 instead of 7114310 bytes, and in the other just 192228 of 1423984 bytes on the server. To guard against some problem delivering the mail attachments, I had asked the user to report the length of the files on his machine, and this agreed with the files I received.
The entries from the Apache HTTP server log show these downloads as having completed normally, with the full file length transferred:
200.x.x.18 - - [06/Aug/2006:14:21:49 +0200] "GET /homeplanet/download/3.1/hp3full.zip HTTP/1.1" 200 7114310 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 200.x.x.18 - - [09/Aug/2006:16:19:41 +0200] "GET /homeplanet/download/3.1/hp3lite.zip HTTP/1.1" 200 1423984 "http://www.fourmilab.ch/homeplanet/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
(I have manually wrapped these long lines onto multiple lines to avoid truncation and obscured the second and third bytes of the user's IP address in the interest of privacy.) In fact, the log showed three attempts to download the 7.1 Mb hp3full.zip file, all apparently complete and successful as seen from the server side, yet the file received by the user was incomplete.
So what we have is a situation in which a user attempts to download a file with Internet Explorer, which contacts the remote site, starts the download, and then gives up part of the way through the download, storing a truncated initial portion of the complete file in the user's destination directory and the browser cache, while providing the user no indication whatsoever that anything is amiss. Neither does the browser's interaction with the Web site indicate any problem has occurred: the Web transfer log invariably shows a complete download with the entire file transferred and a normal completion status.
Even worse, Internet Explorer now has a copy of the truncated download in its browser cache (which it calls “Temporary Internet files”), and if the user attempts to download the file again in the hope that the corruption is a random event which won't repeat, Explorer will act like it's downloading the file, but actually just copy the truncated version from the cache. Hence, unless the user manually clears the cache (instructions for which are given below), he can go on retrying the download until the cows come home, the Sun rises in the west, and Microsoft starts shipping quality products, with the same truncated file resulting from each and every download attempt.
By far, the wisest course for the intelligent Internaut is to abandon Microsoft Internet Explorer entirely, putting its multitudinous security holes, truncated downloads, cache pollution, and eccentric rendering of standards-compliant pages behind them and upgrade to a secure, compatible, and reliable browser such as:
both of which are free. You will still have to use Internet Explorer to run Windows Update, which relies on Explorer-only (and perilous) “ActiveX” components, but after a few hours getting used to it, you'll almost certainly find that one of the competently implemented browsers listed above will improve your Web experience in numerous ways in addition to solving the problem of corrupted downloads.
If, for some reason, you don't want to migrate to a better browser, a lighter footprint solution to the corrupted download problem is available through the following command-line download programs:
These programs allow you to download files from servers using the Web HTTP, secure HTTPS, and FTP protocols with a simple command entered at the Windows command prompt like:
wget http://www.fourmilab.ch/homeplanet/download/3.3a/hp3full.zip
You can avoid having to type in the URL for the document by right clicking the link in your Web browser and choosing a menu item like “Copy Shortcut” or “Copy Link Location” to copy the URL to the clipboard, whence you can paste it into the command prompt window. Not only are these utilities immune to the problem of silently truncated downloads, if something does go wrong during a download, they are both able to resume the download at the point it was interrupted without having to download the portion of the file you have already received; this can be a life-saver when downloading large files on slow dial-up connections which are prone to dropping out at random times.
Some Web sites (although dismayingly few) make their lengthy downloads accessible through the Internet File Transfer Protocol (FTP) as well as the Web's HTTP protocol. As a protocol with more than 35 years of experience under its belt, FTP is highly robust, although it has accreted so many options and quirks over the years that it appeals more to experts than newcomers to the Internet. Every file at Fourmilab is available via FTP as well as HTTP—there is a simple rule for rewriting an HTTP URL to transform it into one which will access the same file with FTP: see the Fourmilab FTP Tutorial for details. (This technique will work only for files at Fourmilab or other sites which are configured identically. The vast majority of sites these days do not provide FTP access to all of their content, and if they do provide FTP downloads, tend to use URLs which cannot be easily deduced from the corresponding HTTP links.)
If you're using one of the competently implemented Web browsers recommended above or one of the command-line download tools, downloading a file with FTP is simply a matter of specifying the FTP URL instead of one for HTTP. For example, to download the same file used in the wget example above with FTP, you would use:
wget ftp://ftp.fourmilab.ch/web/homeplanet/download/3.3a/hp3full.zip
But since these programs download files reliably with HTTP, there isn't much point in using FTP with them, even if it is available. The real advantage of FTP is that almost every computer and operating system which supports connection to a network comes equipped with a command-line FTP client which, in theory, allows you to download files with the FTP protocol without installing any software at all. Command-line FTP clients have a totally retro user interface, steeped in decades of tradition, but with a little experience using one will become second nature: the Fourmilab FTP Tutorial explains the process in excruciating detail. Modern FTP clients support “passive mode” transfers which permit them to work on machines behind firewalls and Network Address Translation (NAT) boxes such as many DSL routers, and support the reget command, which permits resuming an interrupted download from the point it failed.
Unfortunately, as is so often the case when discussing things Microsoft, there is A Catch, namely that one rarely, if ever, encounters the word “modern” in juxtaposition to “Microsoft”. The command-line FTP client supplied with Microsoft Windows, all the way from Windows 95 through present-day Windows XP, is a laughably antiquated dinosaur which lacks support for both passive mode and reget, and hence cannot be used at all by users whose firewalls and/or NAT boxes are not specially configured to permit the original “active mode” protocol. Consequently, command-line FTP with Microsoft's client program may or may not work for you, depending upon the details of your Internet connection. If you're certain you have a direct connection which supports active mode FTP or you're feeling adventurous and want to see if it will work for you, go ahead and try it out following the instructions in the tutorial, but don't be surprised if it doesn't work, and keep in mind that while this may be a solution for downloads from Fourmilab, it won't work for downloads from many other sites.
(Morons who work for Microsoft may tell you that their FTP client does support passive mode FTP, and that “all you have to do is enter the command ‘quote PASV’ before starting the transfer”. This is untrue. This command, which simply sends a “PASV” command to the remote server will, indeed, put the server into passive mode, and you'll receive a nice confirmation of this event. But Microsoft's brain-damaged FTP client still doesn't understand passive mode, so it will attempt to make the file transfer in active mode, which won't work.)
If you find FTP downloads a useful part of your Internet arsenal, you may want to install a competently-implemented FTP client on your Windows machine. One with which I have had good results is:
which has a graphical user interface as well as a command-line interface and is free for non-commercial use. A completely free command-line FTP client which runs on Microsoft Windows as well as almost every other computer and operating system is:
which has been available since 1991 and supports passive mode and resuming interrupted downloads as well as numerous other features including background transfers. For basic operations it works almost exactly like the Microsoft FTP client, except that it pops up its command line in a separate window, and doesn't therefore need to be run from a command prompt: it can be launched from the Start menu.
Of course, in order to use any of these solutions to the download problem (apart from the built-in FTP client), you have to first download and install them, which might seem to recede into infinite regress. Fortunately, the incidence of truncated downloads with Internet Explorer is sufficiently rare that you can probably get away with downloading and installing one or more of the packages mentioned above without any problems; just be aware that repeatedly dodging the bullet doesn't make it any less likely you're going to be nailed the next time.
When you do suffer a download corrupted by Internet Explorer and you wish to retry it in the hope it will work the next time (or the time after that, or…), it is absolutely essential that you first clear the browser cache to get rid of the incomplete download. If you fail to do this, Internet Explorer will, while going through the motions to persuade you that it is downloading the file from the Web site, actually just copy the version saved in the cache, handing you another identically truncated copy. To clear the browser cache in Internet Explorer 6 (heaven knows where they'll hide it in future versions, all in the interest of improving your “browsing experience”), go to the “Tools” menu and select “Internet Options”, which will display a tabbed dialogue; choose the “General” tab. In the middle of this page is a section titled “Temporary Internet files”, which is Microsoftlish for what everybody else calls the browser cache. Click the “Delete Files” button, and when the squawk box pops up, confirm that you really, really want to discard these copies of pages you've viewed, including the truncated download. Click OK to close the “Internet Options” dialogue, and you're now ready to retry the download. Actually, after clearing the cache, I would quit and restart Internet Explorer in case it's got something squirrelled away in its pointed little mushy-brained head remembering the bad download, but this is probably just superstition. Still, it can't hurt.
If you've downloaded a file from Fourmilab and you want to be sure it is complete and identical to the copy on the server, you can use the following form to verify the URL on the Fourmilab site and report its length and signature in two commonly used algorithms. Enter the complete URL in the text field (you can use either HTTP or FTP protocol URLs), and click the “Verify” button. If the URL describes a valid file on the server, its length in bytes, MD5, and SHA1 signatures will be reported. Usually, if the length of the file you downloaded is the same as the file on the server, it has been downloaded without errors. To be extra careful, you may wish to compute the signature of the file using an MD5 or SHA1 utility and check that against the values reported in the verification page (utilities differ as to whether letters in these hexadecimal values are upper or lower case; ignore the case of letters when comparing signatures).
If the length of the file you downloaded differs from the length on the server as reported on the verification page, then you have received an incomplete or garbled file; there is no reason to suspect the file on the server is incorrect. The problem is almost certainly an error in the downloading process or in the application you used to download the file.
Webmasters who wish to add download verification like this to their own sites may download the source code (a GZIP compressed TAR archive containing a CGI application written in the Perl language—if you don't know what this means, you shouldn't download the file). This program was developed for use at Fourmilab, and will have to be adapted for use at other sites. The program is in the public domain and may be used in any way without any restrictions whatsoever, but it is utterly unsupported: you are entirely on your own.
This document is in the public domain.