Statistical Spam Filtering
Geoffrey D. Bennett

- SpamAssassin
- BogoFilter
- dspam

SpamAssassin http://www.spamassassin.org/

- Perl, regexs, scoring

- /\b(?:accept\b|are accepting).{1,15}credit cards?\b/i
  +2.507

- /\b(?:bad|poor|no\b|eliminate|repair|(?:re)?establish|damag).{0,10} (?:credit|debt)\b/i
  +1.230

- eval:html_test('font_blue')
  +0.1

- procmailrc:

  :0fw
  | spamc -f

bogofilter http://bogofilter.sourceforge.net/

- "Bayesian", "Fisher-Robinson's Inverse Chi-Square"

- requires training:

  bogofilter -s < spam-corpus
  bogofilter -n < innocent-corpus

- procmailrc:

  :0fw
  | bogofilter -e -p

  :0:
  * ^X-Bogosity: Yes, tests=bogofilter
  spam-bogofilter

dspam http://www.nuclearelephant.com/projects/dspam/

- "DSPAM has a strong focus on providing better data to already existing
  algorithms (Bayesian, Chi-Square, etcetera) Combination algorithms work
  inherently well, but depend on the quality of data."

- Chained Tokens

  - Combining two "words" together

    Eg. "click" and "here" separately may not mean anything.
    Probability of seeing the two together in a spam is greater.

    Or. "bgcolor" and "#000000"

- Inoculation:

  bob_c:     "|/path/to/dspam --addspam --inoculate --user bob --corpus"

- "Bayesian Noise Reduction"

- Modes:

  - Local Delivery Agent

    - Mail interface for submitting spam

    - Web interface for quarantined spam

  - Filter-style

  - POP3 Proxy

- ./configure --with-dspam-home=$HOME/.dspam --with-userdir-owner=none --with-userdir-group=none --with-dspam-mode=none --with-dspam-owner=none --prefix=$HOME --enable-spam-delivery --enable-delivery-to-stdout

- Training:

  tools/dspam_corpus g innocent-corpus
  tools/dspam_corpus --addspam g spam-corpus

- Using through procmail:

  :0fw
  | dspam --stdout --deliver=spam,innocent --mode=teft --user g --feature=chained,noise

  Mistakes:

  | dspam --addspam
  or
  | dspam --falsepositive