strugee.net

Posts from 2017

Show only posts from January February November December March April May June August October


filter-other-days is portable to FreeBSD

I'm pleased to announce filter-other-days 1.0.1. This is a bugfix release primarily improving portability to other Unix-like operating systems; in particular, the test suite now fully passes under FreeBSD. Specifically:

  • Various portability bugs in the test suite itself were fixed - the test suite no longer relies on a GNU date (with GNU date -d semantics) or a fully-functional /dev/fd (the fallback option is named pipes), and it doesn't hardcode bash's install path as /bin/bash
  • Some non-portable uses of echo "\n" which break under BSD systems were replaced with printf invocations
  • Travis CI now checks filter-other-days with Debian's checkbashisms script, which is run in strict mode
  • Non-portable uses of test's -o option were caught by checkbashisms and replaced with ||

With these changes I expect that filter-other-days will probably run on all major BSD distributions. I intend to confirm this hypothesis soon and have filed bugs for OpenBSD and NetBSD, plus illumos just for kicks.

As with 1.0.0, you can clone filter-other-days from GitHub or you can download a (signed) tarball. Please do report any bugs you find in the release.

Enjoy!


filter-other-days: Artificial Ignorance-compatible logfile date filtering

I've just published version 1.0 of my latest project, filter-other-days - a shell script to filter logfiles for today's date in an Artificial Ignorance-compatible way.

If you haven't heard of Artificial Ignorance, it's something you should look into cause it's pretty awesome. Here's the tl;dr: it doesn't make sense to look for all the "interesting" things in logfiles, because it's not actually possible to enumerate all the failure conditions of a system. So instead what we do is throw away entries that we're sure are just routine. Since we've gotten rid of all the uninteresting entries, whatever is left has to be interesting.

I find this pretty compelling, and decided to start implementing it on my Tor relay. I quickly realized that my ideal workflow would be to configure cron to send me email with a daily report of interesting log entries. However, this presented a problem: how to get just today's log entries? I wanted to be able to handle all logfiles at once instead of receiving different reports for different logs, so I had to be able to parse all logfiles in the same way. My relay runs on FreeBSD, so the logs are unstructured text files, and even worse, several daemons (like Tor itself) write timestamps in a different format - this makes parsing all logfiles at once super difficult because I couldn't just trivially grep for today's date since that would end up dropping legitimate entries from logfiles that formatted their timestamps differently.

I briefly considered trying to write a regex to match all sorts of different timestamp formats, but quickly rejected this idea as too fragile. There are a lot of moving parts in a modern operating system - what if e.g. a daemon changed its defaults about how to format timestamps? Or, more likely, what if I simply missed a particular format present in my logs? Then I'd be accidentally throwing away an entire logfile. To solve this problem, I decided to apply the same idea behind Artificial Ignorance - if I couldn't reliably, 100% match log entries from today's date, I could do the next best thing and attempt to discard all entries from other dates. In this case the worst that could happen is me recieving irrelevant information, and I'd be basically guaranteed to never miss an legitimate entry from today.

filter-other-days is a working implementation of this design. Originally I put it with the other random scripts I keep with my dotfiles, but it quickly became obvious that it was useful as a standalone project. So I extracted it into its own repository, which now lives on GitHub. From there I continued to improve the script while adding a test suite and writing extensive documentation (including a Unix manpage - I always feel like a wizardly hacker when writing those things). This took, by my estimation, somewhere between 10 and 15 hours because this is actually a shockingly non-trivial problem, but mostly because regexes are hard.

But today I finally finished! So I'm super excited to announce that version 1.0 of filter-other-days is now available. You can either clone it from GitHub or download a tarball (and the accompanying signature, if you want). It works pretty well already, but I have some ideas for future directions the project could go:

  1. Logic allowing you to actually specify the date you want to filter for, instead of assuming it's today (though you actually can already get this behavior using faketime; that's what the test suite does)
  2. Removal of the dependency on GNU seq - this is, to my knowledge, the only non-POSIX requirement of filter-other-days
  3. Debian package, maybe?
  4. More log formats (please report bugs if you have formats filter-other-days doesn't recognize - which you probably do!)

If you find this project useful, let me know! I'd love to hear about how people are using it. Or if it breaks (or doesn't fill your usecases), please report bugs or send patches - I love those, too! Either way, may the logs be with you!


pump.io denial-of-service security fixes now available

Recently some denial-of-service vulnerabilities were discovered in various modules that pump.io indirectly depends on. I've bumped Express and send to pull in patched versions, and I've updated our fork of connect-auth to require a patched version of Connect, too. The remaining vulnerabilities I've confirmed don't affect us.

Because of these version bumps, I've just put out security releases which all administrators are encouraged to upgrade to as soon as possible. A semver-major release (5.0.0) was released within the past 6 months so per our security support policy this means there are three new releases:

  1. pump.io 5.0.2 replaces 5.0.0 and is available now on npm
  2. pump.io 4.1.3 replaces 4.1.2 and is available now on npm
  3. pump.io 4.0.2 will replace 4.0.1 and is currently undergoing automated testing (it'll be on npm shortly) Update: pump.io 4.0.2 is now on npm

As these are security releases we encourage admins to upgrade as soon as possible. If you're on 5.0.0 installed via npm - our recommended configuration - you can upgrade by issuing:

$ npm install -g pump.io@5

If you're on 4.1.3, you can upgrade by issuing:

$ npm install -g pump.io@4

And when 4.0.2 is out, if you're on 4.0.1 you can upgrade by issuing:

$ npm install -g pump.io@4.0

Note though that 4.1.3 is a drop-in replacement for 4.0.2, so you should consider just upgrading to that instead. Or even better, upgrade to 5.x!

If you don't have an npm-based install, you'll have to upgrade however you normally do. How to do this will depend on your particular setup.

As always, if you need help, you should get in touch with the community. I'd also like to specifically thank Jason Self, who generously deployed a 24-hour private beta of these fixes on Datamost. One of the version bumps was ever-so-slightly risky, and being able to test things in production before rolling out patches for the entire network was invaluable. I wouldn't be as confident as I am in these releases without his help. So thanks, Jason - I really appreciate it!


Zero-downtime restarts have landed

I'm thrilled to announce that zero-downtime restarts, which I've been hacking on for the past week or two, have just landed in pump.io master!

Zero-downtime restarts require at least two cluster workers and MongoDB as a Databank driver (we'll eventually relax the latter requirement as we continue to test the feature). Here's how it works:

  1. An administrator sends SIGUSR2 to the master pump.io process (note that SIGUSR1 is reserved by Node.js)
  2. The master process builds a queue of worker processes that need to be restarted
  3. The master process picks a random worker from the queue and sends it a signal asking it to gracefully shut down
  4. The worker process shuts down its HTTP server, which causes it to stop accepting new connections - it will do the same for the bounce server, if applicable
  5. The worker shuts down its database connection once the HTTP server is completely shut down, meaning that it's done servicing in-flight requests
  6. The worker closes its connection with the master process and Node.js automatically terminates due to there being no listeners on the event loop
  7. The master recognizes the death of the worker process, replaces it, waits for the new worker to signal that it's listening for connections, and repeats from step 3 until the queue is empty

This works because only one worker is shut down at a time, allowing the other workers to continue servicing requests while the one worker is restarted. We wait until the new worker actually signals it's ready to process requests before beginning the process for another worker.

Such a feature requires careful error handling, so there are a lot of built-in checks to prevent administrators from shooting themselves in the foot:

  • If there's a restart already in progress, SIGUSR2 is ignored
  • If there's only 1 cluster worker, the restart request is refused (because there would be downtime and you should just restart the master)
  • The master process will load a magic number from the new code and compare it with the old magic number loaded when the master process started - if they don't match, SIGUSR2 will be refused. This number will be incremented for things that would make zero-downtime restarts cause problems, for example:
    • The logic in the master process itself changing
    • Cross-process logic changing, such that a new worker communicating with old workers would cause problems
    • Database changes
  • If a worker process doesn't shut itself down within 30 seconds, it will be killed
  • If a zero-downtime restart fails for any reason, the master process will refuse SIGUSR2 and will not respawn any more cluster workers, even if they crash - this is because something must have gone seriously wrong, either with the master, the workers, or the new code, and it's better to just restart everything. Currently this condition occurs when:
    • A new worker died directly after being spawned (e.g. from invalid JSON in pump.io.json)
    • A new worker signaled that it couldn't bind to the appropriate ports

While these checks do a lot to catch problems, they're not a silver bullet, and we strongly recommend that administrators watch their logs as they trigger restarts. However, this is still a huge win for the admin experience - the most exciting part of this for me is that it's the first step we need to take towards having fully automatic updates, which has been a dream of mine for a long while now.

Admins running from git master can start experimenting with this feature today, and it will be released during the next release cycle - i.e. with the 5.1 beta and stable, not the current 5.0 beta. Since this is highly experimental, we want this to have as much time for testing as possible. You can also check out the official documentation on this feature.

I hope people enjoy this! And as always, feel free to report any bugs.


~