Note: I have updated and improved this system. See "Filtering referral spam from my server logs, part 2".

A few years ago I removed all third-party analytics software from my websites and went back to using plain server logs. I use goaccess for viewing server statistics; it's fast and flexible, and it can show information on the command line or generate HTML reports.

One problem is that some sites will send bogus requests with their url as the Referer value. They then show up in logs which makes it much harder to find useful information:

Referer spam ruins everything

There are a couple of solutions I considered:

Blocking the referrals directly from nginx
This would stop them for ever reaching the logs, but would require frequent updating of the server configuration. I was also a little concerned that adding hundreds of blocked urls could slow things down slightly.
Excluding the urls via .goaccessrc
goaccess can already exclude referrals, but there is a hard limit of 64 without recompiling. It also requires the list of urls to be regularly updated.
Filtering the logs before parsing
This adds an extra step, but gives me more flexibility in what is removed.

In the end I went for option #3, as I wanted to try and remove requests that matched a certain pattern instead of filtering individual domains or IP addresses.

Most of the spam referrals I see have a few things in common:

This makes it a little easier to filter them out without using their domain.

GNU/Linux has a number of text processing tools, but grep was the simplest option as I'm ignoring a single string from each line.

The following snippet of bash code removes most spammy requests from the access log and then pipes it into goaccess:

#!/bin/bash
grep -v '\"GET / HTTP/1.1\" 301 194' access_log | goaccess -a

The finished result looks like this:

Referer spam is gone

It's a bit of a scattergun approach and there's a risk that I'll exclude legitimate traffic from the log. However, I'm more interested in referrals to individual posts (which I use to see what people like/don't like) so I'm not too worried.

There are still a few spam sites getting through which use HEAD to request the homepage. I made a few tweaks to exclude them as well, so the final version looks like this:

#!/bin/bash
grep -v -e '\"GET / HTTP/1.1\" 301 194' \
	-e '\"HEAD / HTTP/1.1\"' access_log \
    | goaccess -a

It's not perfect, but so far it's cut out 99% of the referral spam I was seeing.