There are two tasks that never seem to end: removing dog hair from my clothes, and filtering spam.
Last year I implemented a basic system for filtering my logs, but over time it has been less effective. I wanted to create something that:
- Could be reused across all of my sites.
- Could be configured through files instead of having update the report generation script.
- Could utilize third-party spam lists.
Before implementing any changes my goaccess reports looked like this:
Nearly everything on that list is spam, and that's only the top 10. For monthly and yearly reports this amount of spam makes it hard to find pages that are actually doing well.
In the end I modified my existing
grep setup with additional passes. It's a
little slow on large logs, so I usually save the filtered log to speed up
I created three files with keywords to remove:
bots.listcontains a list of bot user agents that I want to ignore. This can also be done from within
goaccess, but I wanted them completely scrubbed from my logs.
global-spammers.listis the contents of
src/domains.txtfrom the referrer-spam-blocker project. It's a pretty hefty list.
spammers.listcontains a list of spam domains that aren't in the global list. I normally add new items once or twice a week.
All of these files live in my "~/.config/" directory so I can use them on different projects. They're simple lists that have one entry per line:
spammersdomain.com spammersdomain.org anotherspammer.net
Filtering the access log looks like this:
grep -v -e '\"GET / HTTP/1.1\" 301 194' -e '\"HEAD / HTTP/1.1\"' access_log \ | grep -v -f "~/.config/spammers.list" \ | grep -v -f "~/.config/bots.list" \ | grep -v -f "~/.config/global-spammers.list" \ | goaccess -a
Which results in a report like this (over 50% of the hits before were from bots or spammers):
No system will be 100% perfect, but so far this has cut out a large portion of the noise from my logs.