A few years ago I removed all third-party analytics software from my websites and went back to using plain server logs. I use goaccess for viewing server statistics; it's fast and flexible, and it can show information on the command line or generate HTML reports.
One problem is that some sites will send bogus requests with their url as the
Referer value. They then show up in logs which makes it much harder to find
There are a couple of solutions I considered:
- Blocking the referrals directly from nginx
- This would stop them for ever reaching the logs, but would require frequent updating of the server configuration. I was also a little concerned that adding hundreds of blocked urls could slow things down slightly.
- Excluding the urls via
goaccesscan already exclude referrals, but there is a hard limit of 64 without recompiling. It also requires the list of urls to be regularly updated.
- Filtering the logs before parsing
- This adds an extra step, but gives me more flexibility in what is removed.
In the end I went for option #3, as I wanted to try and remove requests that matched a certain pattern instead of filtering individual domains or IP addresses.
Most of the spam referrals I see have a few things in common:
- They only hit the homepage.
- They're using an old url, so the request gets redirected.
- They use HTTP/1.1.
This makes it a little easier to filter them out without using their domain.
GNU/Linux has a number of text processing tools, but
grep was the simplest
option as I'm ignoring a single string from each line.
The following snippet of bash code removes most spammy requests from the
access log and then pipes it into
#!/bin/bash grep -v '\"GET / HTTP/1.1\" 301 194' access_log | goaccess -a
The finished result looks like this:
It's a bit of a scattergun approach and there's a risk that I'll exclude legitimate traffic from the log. However, I'm more interested in referrals to individual posts (which I use to see what people like/don't like) so I'm not too worried.
There are still a few spam sites getting through which use
HEAD to request the
homepage. I made a few tweaks to exclude them as well, so the final version
looks like this:
#!/bin/bash grep -v -e '\"GET / HTTP/1.1\" 301 194' \ -e '\"HEAD / HTTP/1.1\"' access_log \ | goaccess -a
It's not perfect, but so far it's cut out 99% of the referral spam I was seeing.