EGGSHELL is a suite of programs which provide access to the
data set assembled by the
Global Consciousness Project,
facilitating development of custom software for exploration and "data mining"
of this vast and rapidly growing resource. Since innumerable questions
can be posed regarding the data set, the programs,
while providing commonly requested analyses, focus on efficient
access to and manipulation of the data, minimising the amount of custom
programming required for exploratory studies. In addition, the efficiency of
optimised native-mode programs permits using them as the basis for
data access tools and presentation software made available to the public
on a Web server. The name "EGGSHELL" denotes programs, mostly run from the
UNIX shell, providing access to the data collected by the worldwide "eggs" of the
Global Consciousness Project
network, as
well as a "wrapper" for the raw egg data files, mediating
access by analysis software and handling details such as exclusion of known-bad
data and calculation of common statistical measures.
All of the programs comprising EGGSHELL are written in the
C++
programming language and use the
Standard
Template Library (STL). In order to use this software, you'll need a recent,
standards-compliant C++ compiler and library; all of these programs were developed
using the GNU C++ compiler (g++) version 2.96 on an Intel Linux system
running kernel 2.4.2-2.
The programs are written in the
Literate Programming
paradigm using
Donald Knuth's
CWEB
programming language. As such, they are meant to be read as well as
run; the programs serve as their own documentation and are intended to be
entertaining and informative as well as efficient. Consequently, this document
contains only an overview of the programs and pointers so you can read
them on your own. CWEB programs automatically emit
ready-to-compile C++ source code and documentation in the
TeX documentation
language. The tools for extracting code and documentation
from CWEB programs are included in the EGGSHELL
distribution, but you needn't use them nor learn the
CWEB language yourself to employ the toolkit in your
own programs; you're perfectly free to write standard C++ and
link to the extracted C++ code, using the TeX files purely as
a manual.
The links from the names of the programs below are to Adobe Acrobat
PDF files produced from the TeX documentation. If your browser has a
PDF plug-in, they will open directly in a new browser window. If your
browser lacks the requisite plug-in, they will invite you to download the PDF
file to a directory on your computer whence you may read with
Acrobat Reader, which is a
free download
from Adobe, available for virtually all popular platforms. The PDF files for
the documents are included in the source distribution which you
may download from the bottom of this
page--if you're planning to read the programs with a stand-alone
copy of Acrobat Reader, it's probably easier to get the distribution
complete with all of the PDF files instead of downloading them one-by-one
from the individual links.
Example Programs
- examples
-
Eight example programs are included in the CWEB program examples.w.
(One advantage of CWEB is that a single self-documenting program can
emit as many C/C++ source and header files as necessary while
remaining organised in a logical fashion for the reader, free of the
dictates of the compiler.) The examples illustrate various analyses
using the toolkit facilities, beginning with simple cases which introduce
the toolkit and progressing to more complicated calculations which explore
its various data extraction and reduction components. Reading this
program is the best way go "get into" the toolkit, exploring
the other programs it uses as you encounter them.
Toolkit Component Programs
The toolkit facilities used by the example programs are
contained in the following programs, discussed in more or less
decreasing order of abstraction from the raw data. The
heart of the toolkit is the
analysis and
eggdata programs, while the remaining
programs provide utilities which can be used in isolation
or in conjunction with the egg database access facilities.
- analysis
-
This program implements higher level analyses of
egg data sets, either original one day summaries or arbitrary time spans
extracted from the collection of daily summaries.
- eggdata
-
Provides tools for reading both egg data tables and
auxiliary databases, such as the properties of individual eggs, known
bad data which should be excluded from analyses, etc., with facilities
for extracting and assembling data sets for analysis.
- statlib
-
General purpose statistical library, which includes
both tools for computing various probability distributions as well as
descriptive statistics on data tables with a user-defined type.
- timedate
-
Facilities for working with times and dates,
using UNIX time_t quantities as the underlying type. Conversion
to and from string representations, Julian day numbers, and
computation of astronomical quantities such as sidereal time, the
phase of the Moon, and the position of the Sun and Moon are
provided.
- fourier
-
Tools for computing Fourier and inverse
Fourier transforms of data sets and determining power spectra from
frequency domain information. Note: I am not an expert in this
domain, and anybody using this code would be well-advised to look
closely at its implementation and test it with known data
before relying on its results.
-
colour
-
Tools for manipulating colours and colour spaces, both
physical and perceptual. This program isn't currently used by any of the
examples; it was implemented for eventual use in stand-alone graphics
generation (eliminating the need for GNUPLOT post-processing), but
may prove useful in "artistic" presentations of the data set.
The toolkit is supplied as a GZIPped TAR archive which extracts into the current
directory. The
CWEB programs are supplied as
.w files, with the
C++ (
.c and
.h) and TeX (
.tex) files already extracted.
Pre-generated PDF files for the documents are also provided.
To build the toolkit and example programs, you'll need a current C++
compiler and library (I used GCC/G++ 2.96 to develop them). After extracting
the archive, build it with:
./configure
make
If all goes well, when this process is complete, you'll have compiled
object files for all of the toolkit components and ready-to-run
executables for the example programs,
example-1
through
example-8. To run the examples, you'll need to
have a copy of the Global Consciousness Project "eggsummary"
and pseudorandom mirror files on your machine (or at least the
ones used in the examples). The
configure process
automatically detects the locations of these files for the
www.fourmilab.ch and
noosphere.princeton.edu
sites, but for other sites you'll need to add definition of the
database locations to the
eggdatabases class definition
in
eggdata.w (see the
set_local_defaults method
and those it calls).
(Unfortunately, the current
noosphere.princeton.edu
server has neither g++ nor TeX installed, so it isn't
possible to build or use these programs there.)
When you modify a
.w file,
the
Makefile automatically rebuilds the C++ programs
and TeX documents it defines; the
CWEB tools which accomplish this are
included in the distribution.
If you have a complete TeX distribution loaded, you can rebuild the
document for a program prog and view it in the TeX previewer
with the command:
make prog.view
and update the PDF documents with:
make doc
Configuring Local Database Paths
If you're installing the analysis toolkit on a machine onto which
you've copied all or part of the CSV format "eggsummary" files, you'll
need to configure the directory name in which the files are kept.
To permit the software to be installed on various analysts'
machines, all of the examples initialise their
eggdatabases
by calling its:
set_local_defaults();
method
The "
./configure" script uses the
"
hostname" utility to obtain the
name of the machine it's running on, and embeds this in the
Makefile and configuration as a definition of the C macro
HOSTNAME. If this is defined, the
set_local_defaults() method is
defined in
eggdata.w, which tests for known hosts and sets the
path names appropriately. To define a new host, add a new
set_hostname_defaults to the eggdatabases class definition in
eggdata.w, using the existing
set_Fourmilab_defaults() and
set_noosphere_defaults() as a model, then add a case for the
hostname to the
set_local_defaults() method immediately
below it in the file. After you've tested the definitions for
your host, please send me a copy of the code you added so I can
incorporate into
eggdata.w in the next release. That
way you won't have to keep modifying the file every time a new
release is posted.
If you don't want to add definitions for your host to eggdata.w,
you can manually initialise the eggdatabases object with
code like the following:
eggdatabases ed;
ed.add_database("gcp", "/home/httpd/html/data/eggsummary");
ed.add_database("pseudo", "/home/httpd/html/data/pseudoeggsummary");
where the call with the argument of
"gcp" specifies the path
name of the directory containing the eggsummary files for the data
taken by the egg network, and the call with
"pseudo" the
path for the pseudorandom mirror data generated by the GCP host.
If you're only using one of the databases in your analyses, you
needn't provide a path for the other. The eggsummary files in the
directories may be compressed with GZIP.
Maintaining Data Integrity
The analysis software relies on two Comma-Separated Value (CSV) databases supplied
with it which identify known bad data in the data set and specify physical
properties of "egg" hosts in the network. These files were current as of the
date the archive was posted, but it's up to you to verify that they're correct
if you're analysing data collected subsequently. The files are as follows:
- eggs.csv
-
Properties of "egg" sites in the Global Consciousness Project network. This is a
machine readable encoding of the
table
posted on the GCP site.
- rotten_egg.csv
-
Known bad data in the GCP database, identified by egg number and the time range
during which the data were unreliable.
The source document
is the table of errors on the GCP Web site; you should confirm that no
additions have been made to this table before performing definitive
analyses. (Note that data collected in the past may be subsequently
discovered to be erroneous, so even data taken prior to the date of
the rotten_egg.csv file in the distribution may be retracted
if discovered bad.) We don't delete bad data from the data set in order to
preserve its integrity and visibility; better to require analysts to exclude
bad data than risk allegations of "cooking the books" by arbitrarily excluding
data.