cleanfeed(8)
NNAAMMEE
Cleanfeed - spam filter for Usenet news servers
SSYYNNOOPPSSIISS
IINNNN:: Installed as ffiilltteerr__iinnnndd..ppll, location is configured
into INN at compile time.
HHiigghhwwiinndd sseerrvveerrss:: typhoond -program cleanfeed -body
NNNNTTPPRReellaayy: ExternalFilter=c:/perl/bin/perl.exe
c:/news/cleanfeed.pl
DDEESSCCRRIIPPTTIIOONN
A spam filter for Usenet servers. CClleeaannffeeeedd blocks spam
on the way into your server, before it is written to disk
or propagated to outbound feeds. It can also block
binaries in non-binary newsgroups and includes several
other features to keep your newsfeed clean.
Cleanfeed currently works with INN, Cyclone, Typhoon,
Breeze, and NNTPRelay servers. As of version 0.95.3, the
program is "unified", meaning the same program will work
with all of the above servers. (Previously, there were
different versions for INN and for the Highwind servers.
NNTPRelay support was new in 0.95.3.)
UUSSAAGGEE
For all versions, place the cleanfeed.conf configuration
file somewhere, then edit the Cleanfeed source file and
change the $$ccoonnffiigg__ddiirr option at the top to point to the
directory where the config file lives.
IINNNN Install the filter file as filter_innd.pl, in the
location you specified in config.data. Once in place,
the filter is loaded with the command ccttlliinnnndd rreellooaadd
ffiilltteerr..ppeerrll mmeeooww. Filtering can be turned on with
ccttlliinnnndd ppeerrll yy and turned off with ccttlliinnnndd ppeerrll nn.
CCyycclloonnee//TTyypphhoooonn//BBrreeeezzee
Add the --pprrooggrraamm <file> and --bbooddyy options to the
bin/start script, where <file> is the location and
name of the Cleanfeed program. Restart the server.
Cleanfeed will run as an external process.
NNNNTTPPRReellaayy
Find the ExternalFilter directive in config.txt and
make it look like:
ExternalFilter=c:/perl/bin/perl.exe
c:/news/cleanfeed.pl
Cleanfeed will run as an external process.
CCOONNFFIIGGUURRAATTIIOONN OOPPTTIIOONNSS
Configuration is accomplished by setting the various
options in the cleanfeed.conf configuration file. This
file is evaluated as Perl code, so comments can be
included in the usual Perl # syntax.
If you would rather not use the cleanfeed.conf file, you
can set its location to "undef" in the source and edit the
configuration variables directly in the source.
cleanfeed.conf has two sections, %%ccoonnffiigg__llooccaall and
%%ccoonnffiigg__aappppeenndd. Entries in %%ccoonnffiigg__llooccaall will override
the default settings of the same name in the Cleanfeed
source. Entries in %%ccoonnffiigg__aappppeenndd can be used to add to
the default regular expressions for bbaaddgguuyyss, bbiinn__aalllloowweedd,
ppooiissoonn__ggrroouuppss, mmdd55__eexxcclluuddee, aalllleexxcclluuddee, bbaaddddoommaaiinnppaatt, and
eexxeemmpptt. Settings in %%ccoonnffiigg__aappppeenndd for these items will
be appended to the default regexps, seperated by a "|"
(logical or).
All of this is done quite blindly, with no sanity checks
at all, so if you do anything odd, be careful.
Options that are on/off or yes/no should be set to 1 for
on/yes, or 0 for off/no.
First, you need to tell Cleanfeed which news server
software you are using. At the top of the file, set the
appropriate variable to 1. For INN, set $$iinnnn; for
Cyclone, Typhoon, or Breeze, set $$hhiigghhwwiinndd; and for
NNTPRelay, set $$nnnnttpprreellaayy. Ensure the other two (the ones
you're not using) are set to 0.
aaggggrreessssiivvee
Set this to 0 to disable all content-based and highly-
aggressive filters. Helpful to please paranoid
lawyers, or paranoid customers.
ddoo__mmdd55
When turned on, the MD5 EMP checks will be done. This
should be left on unless you have a really good reason
to turn it off. If you're running Hippo along with
Cleanfeed, you might feel Cleanfeed's MD5 checks are
redundant and want to turn them off, for example. It
would probably be better to leave it on with the
history turned way down, instead.
mmdd55mmaaxxmmuullttiippoossttss
Start rejecting articles after we have seen this many
copies, according to the MD5 checksum filter. This
can safely be set lower than mmaaxxmmuullttiippoossttss.
MMDD55HHiissttoorryy
How many articles to remember for MD5-based EMP
comparison. Since the MD5 filter is not prone to
false positives, setting this higher is a good idea to
catch more spam, if you have the RAM to spare.
MMDD55mmaaxxlliiffee
When a spam is identified by the MD5 EMP filter, it is
saved for continual rejection. MMDD55mmaaxxlliiffee specifies
how long, in hours, to keep a saved MD5 id which is no
longer getting any hits. (A spam id which is still
getting matches will be saved regardless of age.) 24
hours works well; higher is better, but uses more
resources.
ffuuzzzzyy__mmdd55
When turned on, the message bodies will be munged up a
bit before MD5 checksums are generated. Whitespace
and other non-alphanumeric characters are stripped and
letters are forced to lowercase. MIME-seperator lines
are removed, as well as a couple of other bits of
treachery to try to defeat the "hashbuster" spam-bots.
This adds a bit of "fuzziness" to the MD5 filter, and
results in a performance hit as well.
ffuuzzzzyy__mmaaxx__lleennggtthh
Sets the maximum amount of lines for an article body
to be subject to the ffuuzzzzyy__mmdd55 munging (above). This
keeps extremely large articles out of those nasty
regular expressions.
MMDD55HHiissttSSiizzee
The maximum allowed size of the EMP memory for the
MD5-checksum EMP filter. Use this as a "sanity check"
to prevent a sudden burst of spam from eating up all
of your memory. It should be set high enough so that
you normally never hit this number; use the MMDD55MMaaxxLLiiffee
to expire the hash instead.
ddoo__pphhll
Turns on the NNTP-Posting-Host/Lines EMP filter. This
filter identifies spam by identical posting-host
headers and article sizes in a short period of time.
You really don't want to turn this off.
ddoo__ffssll
Turns on the From/Subject/Lines EMP filter. This
filter identifies spam by identical From and Subject
headers and article sizes in a short period of time.
This is the one that gets the least number of hits
these days, but you still don't want to shut it off.
mmaaxxmmuullttiippoossttss
Start rejecting articles after we have seen this many
copies, according to the header-based EMP filter.
AArrttiicclleeHHiissttoorryy
How many ids to remember for header-based EMP
comparison. Setting this higher will catch more spam
because there will be a larger "window" to look at.
Larger settings will also consume more memory and have
a (small) impact on performance. Most articles will
actually take up two entries in this history because
there are two different header-based filters.
EEMMPPmmaaxxlliiffee
Same as MMDD55mmaaxxlliiffee but for the header-based EMP
filter.
EEMMPPHHiissttSSiizzee
Same as MMDD55HHiissttSSiizzee but for the header-based EMP
filter. If you are running the header-based filter
but not the MD5 filter for whatever reason, set this
high.
mmaaxxggrroouuppss
Reject articles crossposted such that followups will
be to more than this many newsgroups. (Behavior
slightly changed from 0.95.3.)
ttffjjmmaaxxggrroouuppss
Specify a special, lower crosspost limit for test,
forsale, and jobs groups, which are especially plagued
by crossposting.
bblloocckk__bbiinnaarriieess
Enables blocking of binary posts in non-binary
newsgroups. Which newsgroups allow binaries is
configured with bbiinn__aalllloowweedd (below).
mmaaxx__eennccooddeedd__lliinneess
Sets the number of uuencoded or base64-encoded lines
to allow before considering a post to be a binary.
This should be set high enough to pass regular PGP
signatures. (Those satanic Netscape crypto-sigs can
die along with the other binaries.) Default is 15
lines, which may be a little low if you are lenient.
bblloocckk__mmiimmee__hhttmmll
Enables blocking of MIME-encapsulated HTML posts.
This does NOT affect straight text/html or
multipart/alternative posts of the type created by
misconfigured Netscape and Microsoft newsreaders, but
ONLY posts which are MIME-encapsulated HTML, a
favorite format of sex spammers which often sneaks in
under the EMP radar.
bblloocckk__hhttmmll
Enables blocking of HTML and multipart/alternative
posts.
bblloocckk__llaattee__ccaanncceellss
If turned on, cancels for recently rejected articles
will be rejected. Set the window with MMIIDDmmaaxxlliiffee
(below). This will result in a huge number of
rejections if you have multiple full feeds and you
aren't backlogging. If you are concerned about your
downstream sites receiving the cancels, leave this
off. If you need a major performance boost, turn it
on.
MMIIDDmmaaxxlliiffee
How long to remember rejected message-ids so cancels
for these posts can later be rejected. Specified in
hours. This only has an effect if bblloocckk__llaattee__ccaanncceellss
is enabled (above).
ttrriimmccyycclleess
The EMP memories are trimmed every ttrriimmccyycclleess times
through the filter.
EEMMPPssttaarrttttrriimmmmiinngg
Tells the filter not to waste time trimming the EMP
memories until they have this many entries. Just a
minor performance enhancement during the first hours
the filter is running or when you first start iinnnndd.
vveerrbboossee
When turned on, verbose logging to news.notice will
happen; spam domains will be listed, etc. When off,
only general messages will be logged, making the
news.daily summaries less interesting but much shorter
and more to the point. (There is, alas, no way to
shut off news.notice logging entirely.) (news.notice
only applies to INN.)
llooggffiillee (Highwind and NNTPRelay)
If set to the path to a file, this will enable logging
of message-ids of all articles processed by the
filter. Rejections will be logged with the reason for
rejection. Note that this will create a very large
logfile which you will need to rotate or delete.
rreeppoorrttffiillee (Highwind and NNTPRelay)
If set to the path to a file, this will enable
generation of a simple report of articles accepted and
rejected. The report file will contain one entry per
line with the start time, end time, number of articles
accepted, and number of articles rejected, tab-
separated.
ssttaattffiillee
If this is set to the full path of a file, a crude
stats file will be written each time the filter is
reloaded with ccttlliinnnndd rreellooaadd ffiilltteerr..ppeerrll mmeeooww (for
INN) or whenever the Cleanfeed process receives a
SIGUSR1 (for Highwind and NNTPRelay). At this point
the file just shows how many entries are present in
each of the EMP histories and the MID history. This
is useful to ensure that your retention-times
EEMMPPmmaaxxlliiffee and MMDD55mmaaxxlliiffee are not too high, and that
your max-sizes are not too low. You want the EMP
memory to expire by time, not max size, for best
performance. The default for this is undef, which
disables the stat file.
More comprehensive stats are planned for the future.
bbiinn__aalllloowweedd
This is a regular expression telling the anti-binary
filter in which newsgroups binaries are allowed. If
all groups in the Newsgroups header match this
pattern, binaries are allowed through the filter.
(This obviously has no effect when the binary filter
is disabled.) The default regexp can be added to in
cleanfeed.conf if you have additional groups where you
want to allow binary posting.
ppooiissoonn__ggrroouuppss
If any groups in the Newsgroups header match this
regexp, the article will be rejected. Thus you can
reject crossposts to certain groups even if they are
also posted to groups you carry. This regexp can be
added to in cleanfeed.conf.
mmdd55eexxcclluuddee
If an article is posted only to groups matching this
regexp, the MD5 EMP filter will not be applied.
Useful for "test" groups where it's okay for lots of
the posts to be the same. This regexp can be added to
in cleanfeed.conf.
aalllleexxcclluuddee
If an article is posted only to groups matching this
regexp, NO checks are applied at all. This regexp can
be added to in cleanfeed.conf.
bbaaddgguuyyss
This is a monster regular expression containing
domains of known spammers. Only the "middle" part of
the domains are listed; these are checked as email
addresses in From headers by appending a list of top-
level domains to the end, and as URLs by prepending
http:// and an optional "www.". If you modify this
list, be very careful not to end up with "||" in there
(two "or" signs in a row); this will match every
single post that comes through, which is Bad.
bbaaddddoommaaiinnppaatt
If a post contains a URL for a site whose domain name
matches this pattern (in .com and .net TLDs only) the
post will be rejected. For example, there are
hundreds of spamming porn sites whose domain names
begin or end with "xxx". This prevents us from having
to keep up with their nonsense. Yes, it's a little
aggressive, but it works.
eexxeemmpptt
Regular expression of NNTP-Posting-Hosts that are
exempt from the posting-host-based EMP filter. This
is for high-output systems where all posts contain the
same NNTP-Posting-Host header, such as AOL, which if
not exempted would end up hitting the posting-host EMP
filter with all of their posts. There aren't many of
these out there; a "regular" multi-user system does
not present a problem because the filter doesn't kick
in until it sees a large number of posts from the same
posting-host and also of the same length, in a short
period of time.
After modifying the filter file, always check for mistakes
by typing:
perl -cw filter_innd.pl (or cleanfeed or whatever you called it)
There should be no errors and no warnings.
You can check cleanfeed.conf with:
perl -cw cleanfeed.conf
You will get several warnings about variables being used
only once; these can be ignored.
If you are running INN, you can modify the file and reload
it with ccttlliinnnndd rreellooaadd ffiilltteerr..ppeerrll mmeeooww while the server
is running. The code (outside the subroutines) will be
executed at that time, each time.
With the Highwind servers, modifying the program will
require a server restart (use the bin/restart script).
Note that this will result in all connections (including
newsreader clients) being dropped. (The next release
should be able to reload its configuration upon receiving
a SIGHUP.)
I have no idea what NNTPRelay does, but I'm guessing it
needs a restart as well.
IINNSSTTAALLLLAATTIIOONN -- IINNNN
These instructions assume you have the Perl hooks compiled
into INN. If you don't, you will need to add them and
rebuild the INN distribution before proceeding.
With INN, Perl is embedded into the innd program. The
filter file defines subroutines that are called by innd at
the appropriate times.
SSYYSSTTEEMM RREEQQUUIIRREEMMEENNTTSS
In order to run Cleanfeed with INN, you will need:
+o INN 1.5.1 or later (1.7.2+insync1.1d recommended)
+o Perl 5.004 or later (will work with 5.003 without MD5
support)
+o Perl hooks compiled into INN
+o The MD5 Perl module
INN is available from:
http://www.isc.org/inn.html
The Insync distribution (highly recommended) is available
from:
http://www.insync.net/~aos/inn.html
The MD5 Perl module is available from:
http://www.perl.com/CPAN-local/modules/by-module/MD5/
Perl itself is available from the Perl home page:
http://www.perl.com/
PPAATTCCHHEESS AANNDD SSTTUUFFFF
Cleanfeed requires some patches to INN in order to
function properly. Unfortunately, I had to change
filter.patch for Cleanfeed version 0.95.3; the change is
trivial, but I needed access to another header
(X-Newsreader) . Nothing will break if you do not install
this patch, but you will lose several very effective
filters.
If you are running INN 1.7.2+insync1.1d, you already have
the original filter.patch and the dynamic-load.patch. You
need only apply the upgrade.patch.
ffiilltteerr..ppaattcchh
This patch provides the basic functionality for
Cleanfeed by making some extra headers available to
the Perl filter, as well as message bodies. This
patch was changed in version 0.95.3. It is against
INN 1.7.2 and should be applied in the innd directory.
This patch is included in the insync "megapatch" for
INN as of version 1.1c, so if you are running this
version of INN you need not apply this patch.
ddyynnaammiicc--llooaadd..ppaattcchh
This patch enables INN's Perl interpreter to load
dynamic modules. It is necessary for MD5 support.
The patch is against INN 1.7+insync and should be
applied in the lib directory (NOT the innd directory).
It applies cleanly to other versions of INN including
1.5.1 and 1.7.2. This patch is included in the insync
"megapatch" for INN as of version 1.1d, so if you are
running this version of INN you need not apply this
patch.
If you are still using INN 1.5.1, you can use
dynamic-1.5.1.patch instead.
In order to compile INN with the new patch, you need
to edit the PERL_LIB entry in config.data. Type this
command at the shell, and paste its output into
config.data as PERL_LIB:
perl -MExtUtils::Embed -e ldopts
Apparently, you can also simply enter that line in
backquotes as PERL_LIB.
Finally, you need to install the MD5 Perl module.
TThhiiss ppaattcchh rreeqquuiirreess PPeerrll 55..000044!! IINNNN wwiillll nnoott ccoommppiillee
lliinnkkeedd wwiitthh PPeerrll 55..000033 aafftteerr ffoolllloowwiinngg tthheessee
iinnssttrruuccttiioonnss!!
AAIIXX:: There is a problem with Perl dynamic loading from
INN under the AIX operating system. In simple terms,
it doesn't work. This seems to be a problem with the
gcc compiler. Success has been reported by rebuilding
both Perl and INN with IBM's commercial compiler CSet
(a.k.a. xlC).
uuppggrraaddee..ppaattcchh
For current users of Cleanfeed, this is a patch for an
already-patched INN, or for 1.7.2+insync1.1d, to bring
you up to the new version of the Cleanfeed patch. Not
applying this patch right now will only lose you a
couple of filters, and nothing will break if you don't
apply it (no changes to the filter source or
configuration will be required).
After applying the patches, rebuild all of INN and do a
"make update". The first patch (filter.patch) only
requires innd to be rebuilt, but the dynamic-load.patch
requires you to rebuild the whole distribution. Current
users upgrading with upgrade.patch need only rebuild innd
and reinstall that executable.
Thus:
cd inn [to the top-level source directory]
make clean
cd innd
cp wherever/filter.patch . [from the Cleanfeed distribution]
patch <filter.patch
cd ../lib
cp wherever/dynamic-load.patch [from the Cleanfeed distribution]
patch <dynamic-load.patch
cd ../config
emacs config.data [edit the PERL_LIB entry as above]
make all
make update
IINNSSTTAALLLLIINNGG CCLLEEAANNFFEEEEDD -- IINNNN
The location where INN looks for the Perl filter is set in
config.data, as _PATH_PERL_FILTER_INND. By default, the
filename is filter_innd.pl. The Cleanfeed filter program
file should be installed in this location. INN comes with
an example filter_innd.pl file; move this file (or
whatever other filter is in place) out of the way first.
Before putting the filter in place, edit the file,
changing $$ccoonnffiigg__ddiirr to the location of your
cleanfeed.conf file.
After editing the file, always check for errors with the
command:
perl -cw filter_innd.pl
Once the file is in place, tell innd to reload it:
ctlinnd reload filter.perl meow
And, if Perl filtering is currently disabled, enable it:
ctlinnd perl y
Now, you can watch it working by looking at your
news.notice log:
tail -f /var/log/news/news.notice
If your server is running a full feed, you should start
seeing a constant stream of rejections almost immediately.
IINNSSTTAALLLLAATTIIOONN -- HHIIGGHHWWIINNDD SSEERRVVEERRSS
The various Highwind server packages (Cyclone, Typhoon,
and Breeze) all have the same external filter interface.
The filter runs as its own process, reading from standard
input and writing to standard output.
SSYYSSTTEEMM RREEQQUUIIRREEMMEENNTTSS
In order to run Cleanfeed with a Highwind server, you will
need:
+o Cyclone 1.3.5 or later, or Typhoon or Breeze
+o Perl 5.003 or later
+o The MD5 Perl module
The Highwind servers are commercial products. For more
information:
http://www.highwind.com/
The MD5 Perl module is available from:
http://www.perl.com/CPAN-local/modules/by-module/MD5/
Perl itself is available from the Perl home page:
http://www.perl.com/
IINNSSTTAALLLLIINNGG CCLLEEAANNFFEEEEDD -- HHIIGGHHWWIINNDD
The Cleanfeed program file should be installed as
"cleanfeed" in your news server's bin directory
(cyclone/bin, etc). Make it owned by news.news and make
it executable.
Before putting the filter in place, edit the file,
changing $$ccoonnffiigg__ddiirr to the location of your
cleanfeed.conf file. Also ensure that the shebang line
(the first line of the file, starting with #!) points to
the correct location of your perl executable.
After editing the file, always check for errors with the
command:
perl -cw cleanfeed
There should be no warnings.
Now, edit your bin/start script. You need to add two
options to the command line that starts up the server
process, the --pprrooggrraamm option to tell it what program to
use as a filter, and the --bbooddyy option to tell it to send
the bodies as well as the headers.
./typhoond -program /typhoon/bin/cleanfeed -body
...along with whatever else you have cluttering up the
command line.
(Highwind has indicated that this may/will be a config
file option in a future release.)
Now you can restart the server with the bin/restart
script. Check to see if the server is running, with
something like "ps -e | grep clean" or with top. If you
have a logfile defined in Cleanfeed, it should appear.
(Note about the ps command -- on my system the cleanfeed
process appears in the ps output as "cleanfee", so "ps -e
| grep cleanfeed" obviously won't work as expected.)
IINNSSTTAALLLLAATTIIOONN -- NNNNTTPPRREELLAAYY
Please note that I do not have an NNTPRelay server, nor
access to one, nor much interest in mucking around with
Windows NT, and thus I have not tested the NNTPRelay
filtering support. The necessary changes and notes were
contributed by someone else. Additions and improvements
to this documentation would be most welcome.
The filter interface in NNTPRelay is pretty much the same
as in the Highwind servers.
SSYYSSTTEEMM RREEQQUUIIRREEMMEENNTTSS
In order to run Cleanfeed with NNTPRelay, you will need:
+o NNTPRelay version 1.1b4 or later
+o Perl 5.003 or later
+o The MD5 Perl module
NNTPRelay is available from:
http://nntprelay.maxwell.syr.edu/
An NT binary release of Perl 5.004, which apparently
includes the MD5 module, can be found at:
http://www.perl.com/CPAN/ports/win32/Standard/x86
The MD5 module (in source code) can be found at:
http://www.perl.com/CPAN-local/modules/by-module/MD5/
IINNSSTTAALLLLIINNGG CCLLEEAANNFFEEEEDD -- NNNNTTPPRREELLAAYY
Before putting the filter in place, edit the file,
changing $$ccoonnffiigg__ddiirr to the location of your
cleanfeed.conf file.
Install the Cleanfeed program file wherever is appropriate
on your system, as "cleanfeed.pl". Edit NNTPRelay's
config.txt file, adding an entry like this:
ExternalFilter=c:/perl/bin/perl.exe c:/news/cleanfeed.pl
Of course, use the correct path to your Perl executable
and to the Cleanfeed program file. Now restart NNTPRelay.
If you defined a logfile in Cleanfeed, it should appear.
SSIIGGNNAALLSS
When running under Cyclone, Typhoon, Breeze, or NNTPRelay,
Cleanfeed will catch SIGUSR1 and write its crude current-
status file (see ssttaattffiillee in the config section) on the
next cycle through the filter. (I honestly don't know if
SIGUSR1 is something which exists on NT for NNTPRelay.)
CCRREEDDIITTSS
Written by Jeremy Nixon lt;jeremy@exit109.com.
Based on Jeff Garzik's EMP filter. Original Cyclone port
by David Riley.
I can't possibly mention everyone who has submitted ideas
or fixes for the filter, but I'd like to acknowledge the
substantial contributions of several people: Danhiel
Baker, Frank Copeland, Brian Moore, John Payne, Russ
Allbery, and SeokChan LEE. Thanks, guys.
The dynamic-load.patch for INN is from Piers Cawley. The
body-filtering portion of the INN filter.patch is from
Jeff Garzik.
CCOOPPYYRRIIGGHHTT
Copyright 1997 by Jeremy Nixon, All Rights Reserved.
LLIICCEENNSSEE
This software may be distributed freely, provided it is
intact (including all the files from the original
archive). You may modify it, and you may distribute your
modified version, provided the original work is credited
to the appropriate authors, and your work is credited to
you (don't make changes and pass them off as my work), and
that you aren't charging for it.
AAVVAAIILLAABBIILLIITTYY
This filter is available at:
http://www.exit109.com/~jeremy/news/antispam.html