ANNOYANCE-FILTER
Section: User Commands (1)
Updated: 4 AUG 2004
Index
NAME
annoyance-filter - automatically detect junk mail
SYNOPSIS
annoyance-filter
[
options
]
DESCRIPTION
annoyance-filter
uses Bayesian statistics to determine the probability an
E-mail message is junk based on an analysis of its contents
compared to collections of known junk and legitimate E-mail.
The current version of this program is always posted at:
http://www.fourmilab.ch/annoyance-filter/
Please visit this page for news about the program and
to download the latest version.
The project is hosted on SourceForge, where you will find
the CVS source code repository and release archives:
http://sourceforge.net/projects/annoyancefilter/
USAGE
annoyance-filter
has a multitude of options which permit it to be used in
many different ways, but the most common application involves
training
the program with collections of legitimate and junk mail
in order to create a
dictionary
which indicates the probability that words identify a
message as junk or non-junk (legitimate). Training must
be done before the program is used to classify incoming
mail, but need be done subsequently only when adding
messages to the training collections. As long as
the overall content of the mail, junk and legitimate, which
you receive remains pretty much the same, there's no
need to retrain, but the ability to do so allows the program
to automatically adapt to evolving message content, which is
particularly characteristic of junk mail.
Suppose you have a collection of legitimate mail (in other
words, mail you wish to read) in a file named
m-good
and a collection of junk mail (that which you don't wish
to read) in file
m-junk.
These collections may be in “Unix mail
folder” format, which is simply the text of one or more
E-mail messages concatenated together in a single text file, or
may be the names of directories containing files, each of which
may be a single E-mail message or a Unix mail folder. In either
case, if a message file is compressed with
gzip,
it will be automatically uncompressed on the fly. Directories
of messages may not, however, contain other directories of
messages.
To train
annoyance-filter
with these collections and create a dictionary, use a
command like:
annoyance-filter --mail m-good --junk m-junk --prune --write dict.bin
where
dict.bin
is the name of the dictionary file you wish to create.
Now that the dictionary has been created, you can use it on subsequent
runs to compute the probability a message is junk and classify it
accordingly. Suppose you have an E-mail message in the file
mail.txt.
To compute its junk priority and display it on standard output,
use the command:
annoyance-filter --read dict.bin --test mail.txt
To integrate
annoyance-filter
into a mail processing system such as
procmail,
you'll usually want to run it as a
filter
which reads incoming messages from standard input (piped there
by the mail processing system), classifies them and adds annotations
to the message header indicating the classification, then writes the
message with header annotations to standard output. The mail processing
system may then examine the header annotations and route the
message accordingly. To filter a message, again assuming the
dictionary created by the training run is in the file
dict.bin,
use the command:
annoyance-filter --read dict.bin --transcript - --test -
Here the
--transcript
option is used to request the input message be copied to an
output file, in this case
standard output, specified by
“-”,
with the message read from standard input, the
“-”
argument to the
--test
option.
OPTIONS
Options are specified on the command line. Options are treated as
commands—most instruct the program to perform some specific action;
consequently, the order in which they are specified is
significant; they are processed left to right. Long options
beginning with
“--”
may be abbreviated to any unambiguous
prefix; single-letter options introduced by a single
“-”
without arguments may be aggregated.
- --annotate options
-
Add the annotations requested by the
characters in
options
to the transcript generated
by the
--transcript
option. Upper and lower case
options
are treated identically. Available annotations are:
d Decoder diagnostics
p Parser warnings and error messages
w Most significant words and their probabilities
- --autoprune n
-
As the dictionary is being built by appending
mail to it with the
--mail
and
--junk
options, unique words
will automatically be pruned from it whenever the dictionary
exceeds approximately
n
bytes. This is particularly handy
when loading large collections of messages with
--phrasemax
set greater than one, as a very large number of unique phrases may
clutter the dictionary being built and exceed the memory capacity
of your computer. You could split the mail collection into
multiple parts and explicitly
--prune
after each part, but
--autoprune
is much more convenient.
- --biasmail n
-
The frequency of words appearing in legitimate
mail is inflated by the floating point factor
n,
which defaults
to 2. This biases the classification of messages in favour of
“false negatives”—junk mail deemed legitimate, while
reducing the probability of “false positives” (legitimate
mail erroneously classified as junk, which is
bad).
The higher
the setting of
--biasmail,
the greater the bias in favour of
false negatives will be.
- --binword n
-
Binary character streams (for example, attachments
of application-specific files, including the executable code of
worm and virus attachments) are scanned and contiguous sequences of
alphanumeric ASCII characters
n
characters or longer are
added to the list of words in the message. The dollar sign
(“$”)
is considered an alphanumeric character for these
purposes, and words may have embedded hyphens and apostrophes, but
may not begin or end with those characters. If
--binword
is set to zero, scanning of binary attachments is disabled entirely.
The default setting is 5 characters.
- --bsdfolder
-
The next
--mail
or
--junk
folder will be
parsed using “classic BSD” rules for identifying the start of
individual messages in the folder. In BSD-style folders, the
text
“From ”
as the leftmost characters of a line always
denotes the start of a new message: any appearance of this text in
any other context is always quoted, often by prefixing a
“>”
character. In the default
Unix
folder syntax,
“From ”
only marks the start of a new message if it
appears following one or more blank lines. Note that you must
specify
--bsdfolder
before each folder to be read with BSD
rules; it is not a modal setting.
- --classify fname
-
Classify mail in
fname.
If it
equals or exceeds the junk threshold (see
--threshjunk),
“JUNK”
is written to standard
output and the program exits with status code 3. If the
message scores less than or equal to the mail threshold
(see
--threshmail),
“MAIL”
is written to standard
output and the program exits with status 0. If the
message's score falls between the two thresholds, its
content is deemed indeterminate;
“INDT”
is written to
standard output and the program exits with a status of 4.
The output can be used to set an environment variable in
Procmail
to control the disposition of the message.
If
fname
is
“-”
the message is read from
standard input.
- --clearjunk
-
Clear appearances of words in junk mail from database.
Used when preparing a database of legitimate mail.
- --clearmail
-
Clear appearances of words in legitimate mail from database.
Used when preparing a database of junk mail.
- --copyright
-
Print copyright information.
- --csvread fname
-
Import a dictionary from a
comma-separated value (CSV) file
fname.
Records are
assumed to be in the format written by
--csvwrite
but
need not be sorted in any particular order. Words are added
to those already in memory.
- --csvwrite fname
-
Export a dictionary as a
comma-separated value (CSV)
fname
with this option. Such
files can be loaded into spreadsheet or database programs for
further processing. Words are sorted first in ascending order
of probability they denote junk mail, then lexically.
- --fread, -r fname
-
Load a fast dictionary (previously
created with the
--fwrite
option) from file
fname.
- --fwrite fname
-
Write a dictionary to the file
fname
in fast dictionary format. Fast dictionaries are written in a binary
format which is
not
portable across machines with different
byte order conventions and cannot be added incrementally to assemble
a larger dictionary, but can be loaded in a small fraction of
the time required by the format created by the
--write
command.
Using a fast dictionary for routine classification of incoming
mail drastically reduces the time consumed in loading the
dictionary for each message.
- --help, -u
-
Print how-to-call information including a
list of options.
- --junk, -j fname
-
Add the mail in folder
fname
to the dictionary as junk mail. These folders may be compressed
by a utility the host system can uncompress; specify the complete
file name including the extension denoting its form of compression.
If
fname
is
“-”
the mail folder is read from
standard input.
- --list
-
List the dictionary on standard output.
- --mail, -m fname
-
Add the mail in folder
fname
to the dictionary as legitimate mail. These folders may be compressed
by a utility the host system can uncompress; specify the complete
file name including the extension denoting its form of compression.
If
fname
is
“-”
the mail folder is read from
standard input.
- --newword n
-
The probability that a word seen in mail which
does not appear in the dictionary (or appeared too few times to
assign it a probability with acceptable confidence) is indicative of
junk is set to
n.
The default is 0.2—the odds are that novel
words are more likely to appear in legitimate mail than in junk.
- --pdiag fname
-
Write a diagnostic file to the specified
fname
containing the actual lines the parser processed (after decoding of MIME
parts and exclusion of data deemed unparseable). Use this option when you
suspect problems in decoding or pre-parser filtering.
- --phraselimit n
-
Limit the length of phrases assembled according to the
--phrasemin
and
--phrasemax
options to
n
characters. This
permits ignoring “phrases” consisting of gibberish from mail headers
and un-decoded content. In most cases these items will be discarded by
a
--prune
in any case, but skipping them as they are generated keeps
the dictionary from bloating in the first place. The default value is
48 characters.
- --phrasemin n
-
Calculate probabilities of phrases consisting of
a minumum of
n
words. The default of 1 calculates probabilities for
single words.
- --phrasemax n
-
Calculate probabilities of phrases consisting of
a maximum of
n
words. The default of 1 calculates probabilities for
single words. If you set this too large, the dictionary may grow
to an absurd size.
- --plot fname
-
After loading the dictionary, create a
plot in
fname
.png
of the histogram of words, binned
by their probability of appearance in junk mail. In order to
generate the histogram the
GNUPLOT
and
NETPbm
utilities must be installed on the system; if they are absent,
the
--plot
option will not be available.
- --pop3port n
-
The POP3 proxy server activated by a subsequent
--pop3server
option will listen for connections on port
n.
If
no
--pop3port
is specified, the server will listen on the default
port of 9110. On most systems, you'll have to run the program as
root if you wish the proxy server to listen on a port numbered
1023 or less.
- --pop3server server[:port]
-
Activate a POP3
proxy server which relays requests made on the previously
specified
--pop3port
or the default of 9110 if no port
is specified, to the specified
server,
which may be
given either as an IP address in “dotted quad” notion
such as
10.89.11.131
or a fully-qualified domain name
like
pop.someisp.tld.
The
port
on which the
server
listens for POP3 connections may be specified
after the
server
prefixed by a colon
(“:”); if no
port is specified, the IANA assigned POP3 port 110 will be
used. The POP3 proxy server will pass each message received on
behalf of a requestor through the classifier and return the
annotated transcript to the requestor, who may then filter it
based on the classification appended to the message header. You
must load a dictionary before activating the POP3 proxy server,
and the
--pop3server
option must be the last on the command
line. The server continues to run and service requests until
manually terminated.
- --pop3trace
-
Write a trace of POP3 proxy server operations
to standard error. Each trace message (apart from the dump of the
body of multi-line replies to clients) is prefixed with the
label
“POP3: ”.
- --prune
-
After loading the dictionary from
--mail
and
--junk
folders, this option discards words which appear sufficiently
infrequently that their probability cannot be reliably
estimated. One usually
--prunes the dictionary before
using
--write
to save it for subsequent runs.
- --ptrace
-
Include a token-by-token trace in the
--pdiag
output
file. This helps when adjusting the parser's criteria for recognising
tokens. Setting this option without also specifying a
--pdiag
file will have no effect other than perhaps to exercise your fingers
typing it on the command line.
- --read, -r fname
-
Load a dictionary (previously
created with the
--write
option) from file
fname.
- --sigwords n
-
The probability that a message is junk will be computed
based on the individual probabilities of the
n
words with extremal
probabilities; that is, probabilities most indicative of junk or mail. The
default is 15, but there's no obvious optimal setting for this parameter; it
depends in part on the average length of messages you receive.
- --sloppyheaders
-
To evade filtering programs, some junk mail is sent with
MIME part headers which violate the standard but which most mail clients
accept anyway. This option causes such messages to be parsed as
a browser would, at the cost of standards compliance. If
--sloppyheaders
is used, it should be specified both when
building the dictionary and when testing messages.
- --statistics
-
After loading the dictionary from
--mail
and
--junk
folders, print statistics of the distribution of junk probabilities of
words in the dictionary. The statistics are written to standard output.
- --test, -t fname
-
Test mail in
fname
and
write the estimated probability it is junk to standard output
unless the
--transcript
option is also specified with
standard output
(“-”)
as the destination, in which case
the inclusion of the probability and classification in the
transcript is adjudged sufficient. If the
--verbose
option
is specified, the individual probabilities of the “most
interesting” words in the message will also be output. If
fname
is
“-”
the message is read from standard
input.
- --threshjunk n
-
Set the threshold for classifying a
message as junk to the floating point probability value
n.
The default threshold is 0.9; messages scored
above
--threshjunk
are deemed junk.
- --threshmail n
-
Set the threshold for classifying a
message as legitimate mail to the floating point probability value
n.
The default threshold is 0.9, with messages scored
below
--threshmail
deemed legitimate. Note that you may
leave a gap between the
--threshmail
and
--threshjunk
values (although it makes no sense to set
--threshmail
higher).
Mail scored between the two thresholds will then be judged
of uncertain status.
- --transcript fname
-
Write an annotated transcript of
the original message to the specified
fname.
If
fname
is
“-”,
the transcript is written to
standard output. At the end of the message header, an
X-Annoyance-Filter-Junk-Probability
header item giving
the computed probability and an
X-Annoyance-Filter-Classification
item which gives the
classification of the message according to
the
--threshmail
and
--threshjunk
settings; the
classification is given as
“Mail”,
“Junk”,
or
“Indeterminate”.
- --verbose, -v
-
Print diagnostic information as the program
performs various operations.
- --version
-
Print program version information.
- --write fname
-
Write a dictionary to the file
fname.
The dictionary is written in a binary format which may be
loaded on subsequent runs with the
--read
option. Binary
dictionary files are portable among machines with different
architectures and byte order.
EXIT STATUS
The program exits with a
status of 0 when processing is successfully completed,
1 when an error (I/O or file access in most cases)
occurs, and 2 to indicate a command line syntax error.
If the
--classify
option is specified, an exit status of
0 identifies the message tested as legitimate mail,
3 marks it as junk, and a status of 4 is returned for
messages which cannot be confidently classified as
either mail or junk.
FILES
Files are read or written as requested by options on the
command line; all options which read or write files
take a
fname
argument which gives the file name. The
--classify,
--junk,
--mail,
--test,
and
--transcript
options interpret an argument of
“-”
as denoting standard input or output.
On systems which
provide the required services and utilities, arguments
to the
--junk
and
--mail
options may be compressed files or the name of a directory
containing one or more messages which will be read as if
logically concatenated. Messages in the directory may be
compressed or uncompressed.
Error messages and diagnostic output generated when
the
--verbose
option is specified are written to standard error.
BUGS
Millions, doubtless. This is a program which must cope with
whatever garbage is fed to it from mail folders, trying to
make the best of it. When it messes up, your efforts in
identifying the message which caused the problem and
submitting a verbatim copy of it with your bug report
are much appreciated.
Please report bugs to
bugs@fourmilab.ch
and include
annoyance-filter
in the Subject line. Thanks in advance.
AUTHOR
John Walker
http://www.fourmilab.ch/
This software is in the public domain. Permission to use, copy,
modify, and distribute this software and its documentation for
any purpose and without fee is hereby granted, without any
conditions or restrictions. This software is provided “as
is” without express or implied warranty.
SEE ALSO
gnuplot(1),
gs(1),
gzip(1),
netpbm(1),
procmail(1),
xpdf(1)
annoyance-filter
is written using the
Literate Programming
http://www.literateprogramming.com/ methodology; the
user manual, program, and internal documentation
are developed together, closely interlinked.
Whenever the program is modified, the documentation is
automatically updated, reducing the risk of divergence
between what the manual says and what the program does.
This
man
page is intended as a reference for the command line
options and most common applications of the program. For
comprehensive documentation, including details of how to
integrate
annoyance-filter
with the
procmail
mail processing system, please refer to the complete
documentation published in PDF format, available on the
Web at:
http://www.fourmilab.ch/annoyance-filter/annoyance-filter.pdf
If you have downloaded the
annoyance-filter
source distribution, the corresponding version of
annoyance-filter.pdf
is included in the archive. You can read PDF files
with Acrobat reader (a free download from
http://www.adobe.com/acrobat/readstep.html) or
the
xpdf
or Ghostscript
(gs)
utilities.
Index
- NAME
- SYNOPSIS
- DESCRIPTION
- USAGE
- OPTIONS
- EXIT STATUS
- FILES
- BUGS
- AUTHOR
- SEE ALSO
This document was originally created by
man2html,
using the manual pages on
21:03:21 GMT, August 4, 2004. The HTML was
subsequently updated to XHTML 1.0 Strict by John
Walker in March, 2018.