Filtering referral spam from my server logs

Note: I have updated and improved this system. See "Filtering referral spam from my server logs, part 2".

A few years ago I removed all third-party analytics software from my websites and went back to using plain server logs. I use goaccess for viewing server statistics; it's fast and flexible, and it can show information on the command line or generate HTML reports.

One problem is that some sites will send bogus requests with their url as the Referer value. They then show up in logs which makes it much harder to find useful information:

Referer spam ruins everything

There are a couple of solutions I considered:

Blocking the referrals directly from nginx
This would stop them for ever reaching the logs, but would require frequent updating of the server configuration. I was also a little concerned that adding hundreds of blocked urls could slow things down slightly.
Excluding the urls via .goaccessrc
goaccess can already exclude referrals, but there is a hard limit of 64 without recompiling. It also requires the list of urls to be regularly updated.
Filtering the logs before parsing
This adds an extra step, but gives me more flexibility in what is removed.

In the end I went for option #3, as I wanted to try and remove requests that matched a certain pattern instead of filtering individual domains or IP addresses.

Most of the spam referrals I see have a few things in common:

  • They only hit the homepage.
  • They're using an old url, so the request gets redirected.
  • They use HTTP/1.1.

This makes it a little easier to filter them out without using their domain.

GNU/Linux has a number of text processing tools, but grep was the simplest option as I'm ignoring a single string from each line.

The following snippet of bash code removes most spammy requests from the access log and then pipes it into goaccess:

#!/bin/bash
grep -v '\"GET / HTTP/1.1\" 301 194' access_log | goaccess -a

The finished result looks like this:

Referer spam is gone

It's a bit of a scattergun approach and there's a risk that I'll exclude legitimate traffic from the log. However, I'm more interested in referrals to individual posts (which I use to see what people like/don't like) so I'm not too worried.

There are still a few spam sites getting through which use HEAD to request the homepage. I made a few tweaks to exclude them as well, so the final version looks like this:

#!/bin/bash
grep -v -e '\"GET / HTTP/1.1\" 301 194' \
	-e '\"HEAD / HTTP/1.1\"' access_log \
    | goaccess -a

It's not perfect, but so far it's cut out 99% of the referral spam I was seeing.


Writing a Leanpub book with Emacs

Leanpub has a robust web interface, but it seemed appropriate to create "Writing PHP with Emacs" using Emacs itself. This is how I do it, from managing ideas to publishing the content itself.

Managing ideas

Before I started writing the book I had a single text file that I collected ideas in. This worked okay in the beginning, but once things started moving along it became unwieldy to manage.

I switched to keeping my notes in zetteldeft which made life much easier. There is a single note titled "Writing PHP with Emacs" which I use as an entry point. All related notes are linked to this one, and I also tag them with #php-with-emacs so I can quickly search for them.

Not all of these notes are directly related to the book, but they might contain information that I want to include in the future.

Planning

The entire book plan is stored in a single TODO.org file in the project directory. I use it to keep a list of the overall contents, as well as planning out individual chapters.

The plan looks a little like this (with content removed for brevity):

#+TITLE: PHP with Emacs - TODO List
#+SEQ_TODO: TODO STARTED TO-EDIT | COMPLETE POSTPONED CANCELLED 

* TODO Milestones [4/5]
** COMPLETE Version 0.2
   CLOSED: [2020-06-09 Tue 19:55] DEADLINE: <2020-05-31 Sun>
*** COMPLETE php-mode: Customization - Using TAGS files [3/3]
    CLOSED: [2020-05-30 Sat 10:26] SCHEDULED: <2020-05-28 Thu>
    :PROPERTIES:
    :EFFORT:   1:00
    :END:
    Prefer to use php-lsp, but this is an option too.
     - [X] What TAGS files can be used for
     - [X] Generating
     - [X] Navigating using a TAGS file
       Shortcuts and everything go here.
* TODO Backlog [1/22]
** TODO Configuring a project for WordPress
** TODO Configuring a project for Drupal

I use org-columns for quickly viewing the list of milestones. From this view I can set estimates and view time worked on each milestone and sub-task.

My book plan using column mode

The Backlog heading keeps a list of all future content that I want to work on. When it's time to move an item to the plan, I'll refile it to a milestone using a custom function called pn/org-refile-in-file:

(defvar pn/org-last-refile-marker nil "Marker for last refile")

(defun pn/org-refile-in-file (&optional prefix)
  "Refile to a target within the current file."
  (interactive)
  (let ((helm-org-headings-actions
	 '(("Refile to this heading" . helm-org--refile-heading-to))))
    (save-excursion
      (helm-org-in-buffer-headings)
      (org-end-of-subtree t)
      (setq pn/org-last-refile-marker (point-marker)))))

I use this instead of org-refile as I have a lot of org files that I don't want to be used as targets. Once a headline has been moved to a milestone I'll add a time estimate and schedule it. I use org-projectile so any scheduled tasks in the book TODO show up in my agenda.

Organizing files

The project directory contains the manuscript directory, which is used by Leanpub to generate the book. It also contains a lightweight Emacs user directory which I use when generating screenshots.

I use binder to keep a list of manuscript files. binder also supports adding file-specific notes, although I prefer to keep things in my TODO file.

Using binder to view the book

For most navigation I'll use projectile, but binder is helpful when trying to get an overall look at the book or when I'm jumping between chapters.

Writing

Leanpub has its own offshoot of Markdown called Markua, so everything is written and edited using markdown-mode+. I would have preferred to write in org-mode - and there is an ox-leanpub exporter - but I didn't want to add an extra step to the write -> publish process. Maybe next time.

One of the things that caught me out at the start was line breaks. Markdown will normally ignore a single line break, so hard-wrapped lines will still be treated as a single line. Markua treats them as hard line breaks, so it altered the look of the book and made pages too narrow. I switched to visual-line-mode which wraps lines visually but doesn't alter the text.

I use a .dir-locals.el file to automatically enable specific modes and to increase the font size.

((markdown-mode . ((eval . (progn (turn-off-auto-fill)
				  (text-scale-set 1)
				  (turn-on-olivetti-mode)))
		   (fill-column . 80)
		   (visual-fill-column-width . 80)
		   (mode . flyspell)
		   (mode . binder)
		   (mode . whitespace-cleanup)
		   (mode . visual-line)
		   (mode . visual-fill-column))))

The modes I use are:

  • flyspell – Enables real time spell-checking in the buffer.
  • binder – Enables binder navigation for moving forwards and backwards between chapters.
  • whitespace-cleanup – Strips trailing whitespace when saving a buffer.
  • visual-line – Enables soft word-wrapping.
  • visual-fill-column – Adds word-wrapping at a specific column, instead of the window's edge.

The code in the eval block turns off hard-wrapping, increases the font size, and enables olivetti-mode. olivetti centers the text in the current window which I prefer when writing words.

For taking screenshots, I have a separate Emacs configuration that contains just PHP and web-specific packages. It loads only the packages I want, and sizes the window/fonts to a width that fits on the page. Startup looks like this:

emacs --no-init-file --load "emacs-env/init.el"

The --no-init-file= option tells Emacs to ignore my usual init.el file, and then I manually load the screenshot initializer.

Publishing changes

The "standard" Leanpub plan allows the use of git for managing book files. I use git every day, so the workflow is something I'm very accustomed to. The master branch contains content that will be published, and WIP content is stored in a named branch (usually something like add-XYZ-chapter).

Once content is ready I'll push it to the remote repository and then manually publish it from within Leanpub. There is an API, but that's reserved for the "PRO" plan.

I have two Beeminder goals set up for the book:

  • write-php-book – This goal is updated any time I make a commit to the book repository. This makes sure I'm writing often enough to get things done.
  • publish-php-book – Leanpub can notify a webhook when a new version is published, so I use this to add a data point to Beeminder. This makes sure I'm publishing changes I write instead of waiting until things are "perfect".

I originally started with a word count goal. This worked well when I was trying to write all of the initial content, but once I started editing it made more sense to change to a goal that didn't focus on raw words.


Reducing toil

I recently read Google's advice on Eliminating Toil. Although the guidelines are aimed at Google's web services, I thought it was useful to apply in other places too.

But first things first, what is toil?

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

I think using the word "toil" for these tasks is a little bit of hyperbole, but the idea is sound. There are tasks that I perform on a regular basis that I can automate or eliminate with some thought.

Things I have done

Some of these are really small.

Created elisp functions for writing notes
There are a couple of projects where I keep update logs in a Markdown file. I wrote some small elisp functions to add timestamped headings, so that I can add entries with a shortkey instead of opening the file and typing it in.
Digitized my diet and exercise tracking
I used to keep these all on paper, but I brought things into Emacs which is much quicker.
Automated blog regeneration
The old version of this site needed building through a script, and then the built files had to be uploaded. I've changed this so that adding a post to the blog automatically rebuilds it, and the site is rebuilt every hour so that I can schedule posts. This is something I'll write about in the future.
Wrote some small scripts for fetching error logs
Super small, logging into a machine, cd'ing into the log directory, then running tail can all be done with a single script. Helpful when I get downtime emails.
Tracked how I work on a typical project and looked for rough edges
Some of these were incredibly minor. Things like:
  • Finding quicker ways to skip between text in an Emacs file.
  • Learning how to jump to a function's definition instead of searching for it.
  • Wrote auto-packaging for MaxCop. Manually creating a zip file and adding three files isn't a huge burden, but automating things makes creating releases much easier. It's also something I can reuse for other projects.

More to do

There are still plenty of places where I can make improvements, and I think I'll have to spend some time really looking at my workflows.

A few things I can think of right now:

  • Make use of snippets for inserting boilerplate code.
  • Automate adding tasks to FacileThings when they come in via email.
  • Create a repeatable checklist when processing each inbox item. Having a simple flowchart would reduce some of the mental overhead.
  • Automate downloading and parsing of data sources that I normally do by hand:
    • RunKeeper archives
    • Bank csv files
    • Calorie information for foods

Making these improvements takes some time, but sharpening the saw is an important habit to develop. I'm pretty confident that the effort I've put into to optimizing other areas has already paid off many times over.


Tracking my workouts with Emacs

The system I use for tracking my workouts is similar to the one I use for tracking my diet. Both of them use Emacs and an org-mode file for storing data, but my workout tracking is simpler as I'm not making entries during the day.

A typical workout entry looks like this:

* CAL-OUT Exercise for <2020-10-26 Mon>
  :PROPERTIES:
  :beeminder: bodyweight-workouts
  :beeminder-skip-deadlines: true
  :END:

** Warm up

 - [X] 10 x Shoulder Band
 - [X] 20 x Wrist rotations
 - [X] 20 x Side leg raises
 - [X] 20 x Glute bridges
 - [X] 20 x Donkey kicks
 - [X] 90s x Dead Bugs
 - [X] 90s x Hollow Hold

** Strength

Aim for 5-8 of each one, 3 sets total (4 optional). Rest 90 seconds between  pairs.

|------------------------+-------+-------+-------+-------|
| Exercise               | Set 1 | Set 2 | Set 3 | Set 4 |
|------------------------+-------+-------+-------+-------|
| Bodyweight Squat       |     5 |     5 |     5 |     5 |
| Pull Up                |     8 |     8 |     8 |     8 |
|------------------------+-------+-------+-------+-------|
| Ring Hold (seconds)    |    20 |    20 |    25 |       |
| Romanian Deadlift      |     8 |     8 |     8 |       |
|------------------------+-------+-------+-------+-------|
| Incline Row            |     8 |     8 |     8 |       |
| Pushup                 |     8 |     8 |     8 |       |
|------------------------+-------+-------+-------+-------|
| Lying leg raise        |     6 |     6 |     8 |       |
|------------------------+-------+-------+-------+-------|

** Notes

- Some soreness in the left elbow

It's a really simple mix of an org-mode checklist and a table for individual exercises. I like how flexible tables are in org-mode, and entering data is quick and simple.

The PROPERTIES: block is entirely optional. It uses beeminder.el to submit a data point whenever I close the org-mode headline.

Like my diet file, adding new entries is handled by org-capture. I don't currently extract statistics from my workout file, but it's on my TODO list. Even without any stats it's extremely useful to look back and see where I've improved, and the notes section reminds me of what worked and what didn't.

The routine itself is based on the /r/bodyweightfitness recommended routine.


How to measure my productivity?

I often read - and write - about productivity, but what am I actually thinking of when the word "productivity" comes up? How should I measure how productive I'm being?

My current method is really simple: I count my hours billed. As long as I'm doing enough work, I class my day as productive.

There are a whole bunch of problems with this approach:

  1. It prioritizes work over everything else.
  2. Some days are filled with fixing bugs across a number of different projects. This adds up to more billed time, so I earn more than if I'd concentrated on a single project.
  3. It doesn't take organizational work into account.
  4. It feels like I'm trying to extract as much money out of myself as possible. If I have the choice between working through the evening or relaxing, I'll often pick work because it's the "productive" thing to do.

So it's not ideal. What else could I use?

Count the number of TODO items checked completed
This one sounds good, but it has issues. It's easy to prioritize small, less-important tasks to bump up the numbers.
Velocity
This comes from extreme programming and involves counting how many story points have been completed during a sprint. This would probably involve me summing the estimated time of tasks completed during a week, and then graphing them as the year goes on.
Some kind of points system
I did this before using The Printable CEO. It helped at the time, but it still has the habit of skewing things depending on how points are assigned. I also spent a little too much time on tweaking how everything was set up.
Some kind of calculated score
This would be something that takes priority, size, and earliness (or lateness) of completion into account. I have a feeling I'd end up spending too much time tweaking the calculation.

Out of all of these I prefer the velocity calculation. But it's still missing a key part: what about my goals?

What I really want to be checking is "am I doing enough to move me closer to each goal". I don't have to work on every goal every day, but I also don't want long periods where I'm ignoring them.

I think I'll end up going with a mix of velocity, regular check-ins, and some magic ingredient that I haven't quite figured out yet. None of them have the external reward that comes from completing billed work, so maybe I need to start with some kind of reward system.