There are two tasks that never seem to end: removing dog hair from my clothes, and filtering spam.

Last year I implemented a basic system for filtering my logs, but over time it has been less effective. I wanted to create something that:

Before implementing any changes my goaccess reports looked like this:

Referer spam REALLY ruins everything

Nearly everything on that list is spam, and that's only the top 10. For monthly and yearly reports this amount of spam makes it hard to find pages that are actually doing well.

In the end I modified my existing grep setup with additional passes. It's a little slow on large logs, so I usually save the filtered log to speed up subsequent reports.

I created three files with keywords to remove:

All of these files live in my "~/.config/" directory so I can use them on different projects. They're simple lists that have one entry per line:

spammersdomain.com
spammersdomain.org
anotherspammer.net

Filtering the access log looks like this:

grep -v -e '\"GET / HTTP/1.1\" 301 194' -e '\"HEAD / HTTP/1.1\"' access_log \
    | grep -v -f "~/.config/spammers.list" \
    | grep -v -f "~/.config/bots.list" \
    | grep -v -f "~/.config/global-spammers.list" \
    | goaccess -a

Which results in a report like this (over 50% of the hits before were from bots or spammers):

A yearly report with 90% less spam

No system will be 100% perfect, but so far this has cut out a large portion of the noise from my logs.