Can there be any such thing without the fear of poison and/or interception from spammers? I fear not but would like to at least take a look at the possibility.
A simple method of doing this would be to provide a feed of spam words from every blog. Provide a page inside your blogging tool that allows a user to add “Spam Word Sharing” sites and then update manually when needed. The recipient blog will grab the feed, check for time updated, and if new words are found, add them to its own list of words. The inherent problem of this distributed method is that spammers will be able to look at the list and then modify the information they use in their spams. The upside of this method is that spammers cannot POSSIBLY look at the spam words of each and every blog unless they write some sort of intelligent spammer tool (which is NOT beyond them by any means)
Another means is to have a few centralized sources for the spam words. This would reduce the number of places that people have to go to get the information. This would, however, bring up the age old problem of announcing the presence of other such sources for synchronization. There are hundreds of different ways thats these neighbors can be programmatically announced etc, but they are all very cumbersome to code and easy to break into. This method also makes it easier for spammers to get a hold of the list and poison it or go around it.
I have also thought of the bayesian concept since I did develop a Perceptron based Bayesian Spam filter for real email which worked pretty well (it was an educational venture). Traditionally, in weblog comment spam, we tend to concentrate on a large number of words, phrases, IPs etc (at least I have) without trying to store any intelligence about them. A simple example are the words texas and holdem. Seperately they are innocent, but together they are a surefire spam combination unless your site is about poker and in which case, you have a difficult spam problem anyways. So, if spam systems were developed which stored word intelligence that got modified with each spam comment, this intelligence would be smaller in size, easier to transport and much easier to share. The drawback of this schema is poison from spammers and rapid changes in content.
So, to summarize, we need a “spam information sharing scheme” that is selectively public, is relatively small in size and can easily to integerated into present infrastructures.
What do you think?
Wouldn’t such a solution be itself flooded by spammers hoping to dilute the filtering system?
How about a ‘trust/dont trust’ mechanism – for example, every time a blacklist entry is wrong (ie the blacklist says a comment is spam, when it is ham), you mark it as such, and then blacklist entries submitted by the same users are given less importance by your blogging system in the future, so that spammers COULD still, in theory, try to disrupt the system, but users wouldn’t give any effect to their ratings – in the same way that such modifications to P2P file sharing programs stopped the copyright cartels disrupting the networks with poison files.