Scraping ArXiv: Kludging Open Scientific Hypertext (Fourmilog: None Dare Call It Reason)

« Comments from Author of Hunt for the Skinwalker | Main | Reading List: The Cosmic Landscape »

Saturday, March 11, 2006

Scraping ArXiv: Kludging Open Scientific Hypertext

Ever since the creation of the arXiv.org e-Print Archive (formerly the Preprint Archive at Los Alamos National Laboratory), occasional controversies have erupted over the credentials required to post papers there and the suitability of certain topics in various categories. In one celebrated case, Nobel Laureate Brian Josephson (Physics, 1973) struggled to publish an (admittedly controversial) paper on ArXiv, only to have it moved from the category in which it was original accepted to a “general physics” category without consultation with the author. Such controversies have led to the establishment of an Archive Freedom site to document alleged misconduct by ArXiv management which links to an “Open Archive” site called sciprint.org which, at the present time, appears to be under construction.

Recently, a kerfluffle has broken out over ArXiv's introduction of “trackbacks” to discussion of papers on Web logs. Some critics of string theory claim they have been deliberately excluded from posting trackbacks purely due to their skeptical view of this enterprise. (Following the links in this article will take you to more discussion of this issue, pro and con, and you'll probably want to read.)

Despite their reputation for computer savvy, pioneering of rapid-turnaround paperless publication, and rapid adoption of new technologies, it seems to me that in this case the physicists are approaching this in a surprisingly old media kind of way—they have replaced the opaque peer review process of print journals with opaque guardians at the gate of their electronic successors. Now, if the operators of ArXiv wish to enforce their own criteria (and, indeed, the overwhelming majority of taxpayer-funded scientists cranking out minimum publishable units on the trudge toward tenure seem happy with their judgement, or at least disinclined to complain), why not just let them, but build something better on top of their foundation: “embrace and extend”, as it were? The concept of a single monolithic archive is so twentieth century, anyway; all the technology needed to transcend limitations in the existing archive is at hand and readily deployed. In fact, it is almost identical to that used by news and other feed aggregators.

The idea of an open, alternative archive sounds fine and noble, but nobody will read the papers there if they are all flaky stuff which such an archive is certain to accrete in abundance. It seems to me that the correct approach is to do as programmers do, and encapsulate ArXiv with a “wrapper” which addresses its shortcomings. This could be done just as news aggregators “scrape” sites to build a feed from their content. Suppose one were to set up their own “anarXiv.org” site (hurry up—the name's still available!), which mirrored all of the content of ArXiv at the index level: only the information used in searches and queries would be mirrored; the actual content, even the abstracts, would be retrieved directly from ArXiv through redirection. This would avoid any issue of violation of authors' copyright, since all that would be served is a link to where they posted the paper for public access. Since content would be fetched from ArXiv, updates to papers would be available as soon as they were posted there, so there would be no problem with mirror synchronisation.

The anarXiv.org site would, however, accept submissions with its own, presumably more open policy. (What policy? That's up to whoever built it, and if others disagree with them, let them start their own archives layered atop it. All that's being replicated are links, and the cost of that is next to nil.) Papers on AnarXiv would have URLs similar, but in a distinct name space from those on ArXiv, for example, papers in the general relativity and quantum cosmology section of ArXiv have names like gr-qc/0603999, while a paper in the same section at AnarXiv might be named gr-qc/a0603888, but ArXiv links would work at AnarXiv as well, so if as long as you retrieved them through AnarXiv, you needn't worry where they're coming from. AnarXiv could have its own innovative trackback, comment, filtering, and moderation facilities, limited only by the imagination of those who created it, and these facilities would apply to papers published on ArXiv as well, since any ArXiv publication would transparently “show through”.

Now, I am not foolish enough to think for an attosecond that creating this kind of open archive on top of a scholarly resource would not open the floodgates to trash from every crank, crackpot, and illiterate bozo on the Web—heck, I used to read sci.physics on Usenet! What it would do, if done right, is create a transparent testbed for experimenting with all the various kinds of moderation and filtering which have been discussed ever since Ted Nelson proposed Xanadu. AnarXiv could support third-party plugs-ins, much like those in modern Web browsers like Mozilla Firefox, which implemented various filtering strategies (for example, “transitive endorsement”—show me papers endorsed [or, perhaps, simply retrieved in full] by people I trust and the web of those they trust). If there's actually something to the dream of open hypertext, then how about really trying it out and seeing if it can be made to work instead of fighting over a publication model that dates from the age of print on paper?

Please note that I have no interest whatsoever in building such a system myself, nor participating in such a project beyond making this suggestion. More than six decades have elapsed since Vannevar Bush described Memex in As We May Think, ,and four decades since Ted Nelson coined the word “hypertext” in describing Project Xanadu and Doug Engelbart advocated Augmenting Human Intellect with computers. Now that the technological prerequisites of such systems are on the desks of scientists around the world, shouldn't we be getting on with it?

Posted at March 11, 2006 14:49

Fourmilog: None Dare Call It Reason

John Walker's Fourmilab Change Log

Saturday, March 11, 2006

Scraping ArXiv: Kludging Open Scientific Hypertext