2021-06-11
These are my experiments.. sorry I do not have much statistics on how good these are..
normalizemime.cc - 2021-06-11 version. This is a mime email message parser to be used as a preprocessor for email classification software, version 1.21.
Tries to normalize the content to 8bit encoding with utf-8 character set. Also appends a copy of message body with HTML removed (IMG and A tags remain unaffected). New version as of 2003-08-06 also decodes HTML entities, like ä or ä and limits the size of attached binary files. 2003-08-19 version also decodes URL encoding in HREF and SRC parameters in html, and fixes a mime decoding bug. 2003-09-18 version has a core dump bug fixed, and X-Spam-Status and X-CRM114-Status headers filtered away. 2003-09-22 version filters my pine status headers away and inserts information in text about malformed base64 padding. 2003-10-05 version fixes header decoding and marks unnecessary header encoding with a token. 2004-04-26 version deletes null characters and limit message size to 1MB. 2005-12-13 version recognizes some more misspelled charsets. 2007-03-20 version allows easier tweaking of header purge list, see around line 1500. 2021-06-11 case insensitive encoding names allowed, thanks Mirko Buffoni!
2004-09-17 version changes:
2004-06-28 version changes:
2005-06-28 version 1.16 fixes a core dump on null chars. Thanks to Richard Carver for pointing that out.
These text strings on the output come from base64 decoding, and indicate possible attacks against decoders and virus/spam scanners:
X-warn: jHnnb3URVED5UgX9fxnZfAsV invalid base64 padding X-warn: ksU7AwpcqQoiCC84ceueEqKn padding inside base64If the base64 string was inside a header, the headers get mangled totally, so this is not strictly speaking a header but just a word token that crm114 could learn.
This is a spam indicator that the header was encoded even if it only consisted of US-ASCII:
ONLYASCIIKFrjuZnFvmJJdrRkeXrd95wu
String ICONVERROR5iorjkfewfmkdfs2lklkfsd is added when the first
charset coversion error is detected.
String UTFATTACK45809jkHJSD82rk8903jdfj3 is added if suspicious
UTF-8 encodings are used.
String BADHEADERCHARSETckW2eAWEEyAGmHQK is added if the encoding in header
was not recognized.
This header is added if the message body charset is not recognized:
X-warn: 3j94twCXM5njkztE bad charset in body
These are the headers that are removed from the messages. The list is around line 1500 in the source.
"X-Spam-", // Added by SpamAssasin for example "X-CRM114-", // Added by CRM114 "X-Virus-", // Added by ClamAV "X-UID:", // added by Pine mail user agent "Status:", // added by Pine mail user agent "X-Status:", // added by Pine mail user agent "X-Keywords:", // added by Pine mail user agent
After this filtering, the email message no more confirms to any standards, and formatting information is irreversibly lost. Even the MIME message structure is potentially corrupted as the encodings are decoded and message separators may appear inside the data.
This filter is useful for preprocessing messages for content recognizing spam filters, like crm114.
mailfilter.crm - normalizes the email and then filters it. Modified from some old crm114 source distribution.
mailfilterconfig.crm - used by both mailfilter.crm and learnspam.crm below.
procmail.txt - .procmailrc example to be used with above scripts.
learnspam.pl and learnspam.crm - splits mailbox up to
individual emails before using normalizemime
to remove
mime encodings and html. Then learn the email unless it already is
classified correctly by current css files. This is a TOE (Train on
error) behaviour. I removed all blacklist processing from
mailfilter.crm when converting it to learnspam.crm, as it is bad to
use them when learning.
These scripts need a recent CRM version. The perl script is used to read the text files one messages at a time and feed it to crm114. Spam and nonspam text chunks are trained alternating. Multiple messages are not feeded to crm114 as it seems not to work well with current TRE regexp bugs with UTF-8 text. That's a shame, as it slows down the process 10 times. Note that also you need to remove the msync() call from crm_markovian.c to get any performance out from crm learning.
Have spamtext.txt
and
nonspamtext.txt
mailbox files ready before running the
script. Usage example:
./learnspam.pl ./learnspam.pl rerun ./learnspam.pl rerun ./learnspam.pl rerunA couple of reruns with the same material like in the example above seems to work for me. I created a shell script to accomplish this and generate statistics: learnspamtest.sh. Here are my latest with repeated TOE, that is, TUNE (Train until no errors):
nonspam learned: 55 spam learned: 39 nonspam ignored: 2012 spam ignored: 1130 nonspam learned: 18 spam learned: 29 nonspam ignored: 2049 spam ignored: 1140 nonspam learned: 9 spam learned: 12 nonspam ignored: 2058 spam ignored: 1157 nonspam learned: 3 spam learned: 5 nonspam ignored: 2064 spam ignored: 1164 nonspam learned: 0 spam learned: 0 nonspam ignored: 2067 spam ignored: 1169Finally the fifth run usually is clean.