1:bogofilter

From Linux Man Pages

Jump to: navigation, search
      bogofilter - fast Bayesian spam filter
      

Contents

SYNOPSIS

      bogofilter [help options classification options registration options parameter options info options]
                 [general options] [config file options]
 
      where
 
      help options are:
 
      [-h] [--help] [-V] [-Q]
 
      classification options are:
 
      [-p] [-e] [-t] [-T] [-u] [-H] [-M] [-b] [-B object ...] [-R] [general options] [parameter options]
       [config file options]
 
      registration options are:
 
      [-s -n] [-S -N] [general options]
 
      general options are:
 
      [-c filename] [-C] [-d dir] [-k cachesize] [-l] [-L tag] [-I filename] [-O filename]
 
      parameter options are:
 
      [-E value[,value]] [-m value[,value][,value]] [-o value[,value]]
 
      info options are:
 
      [-v] [-y date] [-D] [-x flags]
 
      config file options are:
 
      [--option=value]
 
      Note: Use bogofilter --help to display the complete list of options.

DESCRIPTION

      Bogofilter is a Bayesian spam filter. In its normal mode of operation, it takes an email message or other text on
      standard input, does a statistical check against lists of "good" and "bad" words, and returns a status code
      indicating whether or not the message is spam.  Bogofilter is designed with a fast algorithm, uses the Berkeley
      DB for fast startup and lookups, coded directly in C, and tuned for speed, so it can be used for production by
      sites that process a lot of mail.

THEORY OF OPERATION

      Bogofilter treats its input as a bag of tokens. Each token is checked against a wordlist, which maintains counts
      of the numbers of times it has occurred in non-spam and spam mails. These numbers are used to compute an estimate
      of the probability that a message in which the token occurs is spam. Those are combined to indicate whether the
      message is spam or ham.
 
      While this method sounds crude compared to the more usual pattern-matching approach, it turns out to be extremely
      effective. Paul Graham's paper [1] A Plan For Spam is recommended reading.
 
      This program substantially improves on Paul's proposal by doing smarter lexical analysis.  Bogofilter does proper
      MIME decoding and a reasonable HTML parsing. Special kinds of tokens like hostnames and IP addresses are retained
      as recognition features rather than broken up. Various kinds of MTA cruft such as dates and message-IDs are
      ignored so as not to bloat the wordlist. Tokens found in various header fields are marked appropriately.
 
      Another improvement is that this program offers Gary Robinson's suggested modifications to the calculations (see
      the parameters robx and robs below). These modifications are described in Robinson's paper [2]Spam Detection.
 
      Since then, Robinson (see his Linux Journal article [3]A Statistical Approach to the Spam Problem) and others
      have realized that the calculation can be further optimized using Fisher's method.  [4]Another improvement
      compensates for token redundancy by applying separate effective size factors (ESF) to spam and nonspam
      probability calculations.
 
      In short, this is how it works: The estimates for the spam probabilities of the individual tokens are combined
      using the "inverse chi-square function". Its value indicates how badly the null hypothesis that the message is
      just a random collection of independent words with probabilities given by our previous estimates fails. This
      function is very sensitive to small probabilities (hammish words), but not to high probabilities (spammish
      words); so the value only indicates strong hammish signs in a message. Now using inverse probabilities for the
      tokens, the same computation is done again, giving an indicator that a message looks strongly spammish. Finally,
      those two indicators are subtracted (and scaled into a 0-1-interval). This combined indicator (bogosity) is close
      to 0 if the signs for a hammish message are stronger than for a spammish message and close to 1 if the situation
      is the other way round. If signs for both are equally strong, the value will be near 0.5. Since those message
      don't give a clear indication there is a tristate mode in bogofilter to mark those messages as unsure, while the
      clear messages are marked as spam or ham, respectively. In two-state mode, every message is marked as either spam
      or ham.
 
      Various parameters influence these calculations, the most important are:
 
      robx: the score given to a token which has not seen before. robx is the probability that the token is spammish.
 
      robs: a weight on robx which moves the probability of a little seen token towards robx.
 
      min-dev: a minimum distance from .5 for tokens to use in the calculation. Only tokens farther away from 0.5 than
      this value are used.
 
      spam-cutoff: messages with scores greater than or equal to will be marked as spam.
 
      ham-cutoff: If zero or spam-cutoff, all messages with values strictly below spam-cutoff are marked as ham, all
      others as spam (two-state). Else values less than or equal to ham-cutoff are marked as ham, messages with values
      strictly between ham-cutoff and spam-cutoff are marked as unsure; the rest as spam (tristate)
 
      sp-esf: the effective size factor (ESF) for spam.
 
      ns-esf: the ESF for nonspam. These ESF values default to 1.0, which is the same as not using ESF in the
      calculation. Values suitable to a user's email population can be determined with the aid of the bogotune program.

OPTIONS

      HELP OPTIONS
 
      The -h option prints the help message and exits.
 
      The -V option prints the version number and exits.
 
      The -Q (query) option prints bogofilter's configuration, i.e. registration parameters, parsing options,
      bogofilter directory, etc.
 
      CLASSIFICATION OPTIONS
 
      The -p (passthrough) option outputs the message with an X-Bogosity line at the end of the message header. This
      requires keeping the entire message in memory when it's read from stdin (or from a pipe or socket). If the
      message is read from a file that can be rewound, bogofilter will read it a second time.
 
      The -e (embed) option tells bogofilter to exit with code 0 if the message can be classified, i.e. if there is not
      an error. Normally bogofilter uses different codes for spam, ham, and unsure classifications, but this simplifies
      using bogofilter with procmail or maildrop.
 
      The -t (terse) option tells bogofilter to print an abbreviated spamicity message containing 1 letter and the
      score. Spam is indicated with "Y", ham by "N", and unsure by "U". Note: the formatting can be customized using
      the config file.
 
      The -T provides an invariant terse mode for scripts to use.  bogofilter will print an abbreviated spamicity
      message containing 1 letter and the score. Spam is indicated with "S", ham by "H", and unsure by "U".
 
      The -TT provides an invariant terse mode for scripts to use.  Bogofilter prints only the score and displays it to
      16 significant digits.
 
      The -u option tells bogofilter to register the message's text after classifying it as spam or non-spam. A spam
      message will be registered on the spamlist and a non-spam message on the goodlist. If the classification is
      "unsure", the message will not be registered. Effectively this option runs bogofilter with the -s or -n flag, as
      appropriate. Caution is urged in the use of this capability, as any classification errors bogofilter may make
      will be preserved and will accumulate until manually corrected with the -Sn and -Ns option combinations. Note
      this option causes the database to be opened for write access, which can entail massive slowdowns through lock
      contention and synchronous I/O operations.
 
      The -H option tells bogofilter to not tag tokens from the header. This option is for testing, you should not use
      it in normal operation.
 
      The -M option tells bogofilter to process its input as a mbox formatted file. If the -v or -t option is also
      given, a spamicity line will be printed for each message.
 
      The -b (streaming bulk mode) option tells bogofilter to classify multiple objects whose names are read from
      stdin. If the -v or -t option is also given, bogofilter will print a line giving file name and classification
      information for each file. This is an alternative to -B which lists objects on the command line.
 
      An object in this context shall be a maildir (autodetected), or if it's not a maildir, a single mail unless -M is
      given - in that case it's processed as mbox. (The Content-Length: header is not taken into account currently.)
 
      When reading mbox format, bogofilter relies on the empty line after a mail. If needed, formail -es will ensure
      this is the case.
 
      The -B object ...  (bulk mode) option tells bogofilter to classify multiple objects named on the command line.
      The objects may be filenames (for single messages), mailboxes (files with multiple messages), or directories (of
      maildir and MH format). If the -v or -t option is also given, bogofilter will print a line giving file name and
      classification information for each file. This is an alternative to -b which lists objects on stdin.
 
      The -R option tells bogofilter to output an R data frame in text form on the standard output. See the section on
      integration with R, below, for further detail.
 
      REGISTRATION OPTIONS
 
      The -s option tells bogofilter to register the text presented as spam. The database is created if absent.
 
      The -n option tells bogofilter to register the text presented as non-spam.
 
      Bogofilter doesn't detect if a message registered twice. If you do this by accident, the token counts will off by
      1 from what you really want and the corresponding spam scores will be slightly off. Given a large number of
      tokens and messages in the wordlist, this doesn't matter. The problem can be corrected by using the -S option or
      the -N option.
 
      The -S option tells bogofilter to undo a prior registration of the same message as spam. If a message was
      incorrectly entered as spam by -s or -u and you want to remove it and enter it as non-spam, use -Sn. If -S is
      used for a message that wasn't registered as spam, the counts will still be decremented.
 
      The -N option tells bogofilter to undo a prior registration of the same message as non-spam. If a message was
      incorrectly entered as non-spam by -n or -u and you want to remove it and enter it as spam, then use -Ns. If -N
      is used for a message that wasn't registered as non-spam, the counts will still be decremented.
 
      GENERAL OPTIONS
 
      The -c filename option tells bogofilter to read the config file named.
 
      The -C option prevents bogofilter from reading configuration files.
 
      The -d dir option allows you to set the directory for the database. See the ENVIRONMENT section for other
      directory setting options.
 
      The -k cachesize option sets the cache size for the BerkeleyDB subsystem, in units of 1 MiB (1,048,576 bytes).
      Properly sizing the cache improves bogofilter's performance. The recommended size is one third of the size of the
      database file. You can run the bogotune script (in the tuning directory) to determine the recommended size.
 
      The -l option writes an informational line to the system log each time bogofilter is run. The information logged
      depends on how bogofilter is run.
 
      The -L tag option configures a tag which can be included in the information being logged by the -l option, but it
      requires a custom format that includes the %l string for now. This option implies -l.
 
      The -I filename option tells bogofilter to read its input from the specified file, rather than from stdin.
 
      The -O filename option tells bogofilter where to write its output in passthrough mode. Note that this only works
      when -p is explicitly given.
 
      PARAMETER OPTIONS
 
      The -E value[,value] option allows setting the sp-esf value and the ns-esf value. With two values, both sp-esf
      and ns-esf are set. If only one value is given, parameters are set as described in the note below.
 
      The -m value[,value][,value] option allows setting the min-dev value and, optionally, the robs and robx values.
      With three values, min-dev, robs, and robx are all set. If fewer values are given, parameters are set as
      described in the note below.
 
      The -o value[,value] option allows setting the spam-cutoff ham-cutoff values. With two values, both spam-cutoff
      and ham-cutoff are set. If only one value is given, parameters are set as described in the note below.
 
      Note: All of these options allow fewer values to be provided. Values can be skipped by using just the comma
      delimiter, in which case the corresponding parameter(s) won't be changed. If only the first value is provided,
      then only the first parameter is set. Trailing values can be skipped, in which case the corresponding parameters
      won't be changed. Within the parameter list, spaces are not allowed after commas.
 
      INFO OPTIONS
 
      The -v option produces a report to standard output on bogofilter's analysis of the input. Each additional v will
      increase the verbosity of the output, up to a maximum of 4. With -vv, the report lists the tokens with highest
      deviation from a mean of 0.5 association with spam.
 
      Option -y date can be used to override the current date when timestamping tokens. A value of zero(0) turns off
      timestamping.
 
      The -D option redirects debug output to stdout.
 
      The -x flags option allows setting of debug flags for printing debug information. See header file debug.h for the
      list of usable flags.
 
      CONFIG FILE OPTIONS
 
      Using GNU longopt -- syntax, a config file's name=value statement becomes a command line's --option=value. Use
      command bogofilter --help for a list of options and see bogofilter.cf.example for more info on them. For example
      to change the X-Bogosity header to "X-Spam-Header", use:
 
      --spam-header-name=X-Spam-Header

ENVIRONMENT

      Bogofilter uses a database directory, which can be set in the config file. If not set there, bogofilter will use
      the value of BOGOFILTER_DIR. Both can be overridden by the -d dir option. If none of that is available,
      bogofilter will use directory $HOME/.bogofilter.

CONFIGURATION

      The bogofilter command line allows setting of many options that determine how bogofilter operates. File
      /etc/bogofilter.cf can be used to set additional parameters that affect its operation. File
      /etc/bogofilter.cf.example has samples of all of the parameters. Status and logging messages can be customized
      for each site.

RETURN VALUES

      0 for spam; 1 for non-spam; 2 for unsure ; 3 for I/O or other errors.
 
      If both -p and -e are used, the return values are: 0 for spam or non-spam; 3 for I/O or other errors.
 
      Error 3 usually means that the wordlist file bogofilter wants to read at startup is missing or the hard disk has
      filled up in -p mode.

INTEGRATION WITH OTHER TOOLS

      Use with procmail
 
      The following recipe (a) spam-bins anything that bogofilter rates as spam, (b) registers the words in messages
      rated as spam as such, and (c) registers the words in messages rated as non-spam as such. With this in place, it
      will normally only be necessary for the user to intervene (with -Ns or -Sn) when bogofilter miscategorizes
      something.
 
      # filter mail through bogofilter, tagging it as Ham, Spam, or Unsure,
      # and updating the wordlist
 
      :0fw
      | bogofilter -u -e -p
 
      # if bogofilter failed, return the mail to the queue;
      # the MTA will retry to deliver it later
      # 75 is the value for EX_TEMPFAIL in /usr/include/sysexits.h
 
      :0e
      { EXITCODE=75 HOST }
 
      # file the mail to spam-bogofilter if it's spam.
 
      :0:
      * ^X-Bogosity: Spam, tests=bogofilter
      spam-bogofilter
 
      # file the mail to unsure-bogofilter
      # if it's neither ham nor spam.
 
      :0:
      * ^X-Bogosity: Unsure, tests=bogofilter
      unsure-bogofilter
 
      # With this recipe, you can train bogofilter starting with an empty
      # wordlist.  Be sure to check your unsure-folder regularly, take the
      # messages out of it, classify them as ham (or spam), and use them to
      # train bogofilter.
 
      The following procmail rule will take mail on stdin and save it to file spam if bogofilter thinks it's spam:
 
      :0HB:
      * ? bogofilter
      spam
 
      and this similar rule will also register the tokens in the mail according to the bogofilter classification:
 
      :0HB:
      * ? bogofilter -u
      spam
 
      If bogofilter fails (returning 3) the message will be treated as non-spam.
 
      This one is for maildrop, it automatically defers the mail and retries later when the xfilter command fails, use
      this in your ~/.mailfilter:
 
      xfilter "bogofilter -u -e -p"
      if (/^X-Bogosity: Spam, tests=bogofilter/)
      {
        to "spam-bogofilter"
      }
 
      The following .muttrc lines will create mutt macros for dispatching mail to bogofilter.
 
      macro index d "<enter-command>unset wait_key\n\
      <pipe-entry>bogofilter -n\n\
      <enter-command>set wait_key\n\
      <delete-message>" "delete message as non-spam"
      macro index \ed "<enter-command>unset wait_key\n\
      <pipe-entry>bogofilter -s\n\
      <enter-command>set wait_key\n\
      <delete-message>" "delete message as spam"
 
      Integration with Mail Transport Agent (MTA)
 
      1. bogofilter can also be integrated into an MTA to filter all incoming mail. While the specific implementation
         is MTA dependent, the general steps are as follows:
 
      2. Install bogofilter on the mail server
 
      3. Prime the bogofilter databases with a spam and non-spam corpus. Since bogofilter will be serving a larger
         community, it is important to prime it with a representative set of messages.
 
      4. Set up the MTA to invoke bogofilter on each message. While this is an MTA specific step, you'll probably need
         to use the -p, -u, and -e options.
 
      5. Set up a mechanism for users to register spam/non-spam messages, as well as to correct mis-classifications.
         The most generic solution is to set up alias email addresses to which users bounce messages.
 
      6. See the doc and contrib directories for more information.
 
      Use of R to verify bogofilter's calculations
 
      The -R option tells bogofilter to generate an R data frame. The data frame contains one row per token analyzed.
      Each such row contains the token, the sum of its database "good" and "spam" counts, the "good" count divided by
      the number of non-spam messages used to create the training database, the "spam" count divided by the spam
      message count, Robinson's f(w) for the token, the natural logs of (1 - f(w)) and f(w), and an indicator character
      (+ if the token's f(w) value exceeded the minimum deviation from 0.5, - if it didn't). There is one additional
      row at the end of the table that contains a label in the token field, followed by the number of words actually
      used (the ones with + indicators), Robinson's P, Q, S, s and x values and the minimum deviation.
 
      The R data frame can be saved to a file and later read into an R session (see [5]the R project website for
      information about the mathematics package R). Provided with the bogofilter distribution is a simple R script
      (file bogo.R) that can be used to verify bogofilter's calculations. Instructions for its use are included in the
      script in the form of comments.

LOG MESSAGES

      Bogofilter writes messages to the system log when the -l option is used. What is written depends on which other
      flags are used.
 
      A classification run will generate (we are not showing the date and host part here):
 
      bogofilter[1412]: X-Bogosity: Ham, spamicity=0.000227
      bogofilter[1415]: X-Bogosity: Spam, spamicity=0.998918
 
      Using -u to classify a message and update a wordlist will produce (one a single line):
 
      bogofilter[1426]: X-Bogosity: Spam, spamicity=0.998918,
        register -s, 329 words, 1 messages
 
      Registering words (-l and -s, -n, -S, or -N) will produce:
 
      bogofilter[1440]: register-n, 255 words, 1 messages
 
      A registration run (using -s, -n, -N, or -S) will generate messages like:
 
      bogofilter[17330]: register-n, 574 words, 3 messages
      bogofilter[6244]: register-s, 1273 words, 4 messages

FILES

      /etc/bogofilter.cf
             System configuration file.
 
      ~/.bogofilter.cf
             User configuration file.
 
      ~/.bogofilter/wordlist.db
             Combined list of good and spam tokens.

RELATED

      bogolexer(1), bogotune(1), bogoupgrade(1), bogoutil(1)

REFERENCES

      1. A Plan For Spam
         http://www.paulgraham.com/spam.html
 
      2. Spam Detection
         http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
 
      3. A Statistical Approach to the Spam Problem
         http://www.linuxjournal.com/article/6467
 
      4. Another improvement
         http://www.garyrobinson.net/2004/04/improved_chi.html
 
      5. the R project website
         http://cran.r-project.org/
 
      6. bogofilter project page
         http://bogofilter.sourceforge.net/

CATEGORY

Personal tools