strugee.net

Posts categorized as "sysadmin"

filter-other-days is portable to FreeBSD

I'm pleased to announce filter-other-days 1.0.1. This is a bugfix release primarily improving portability to other Unix-like operating systems; in particular, the test suite now fully passes under FreeBSD. Specifically:

  • Various portability bugs in the test suite itself were fixed - the test suite no longer relies on a GNU date (with GNU date -d semantics) or a fully-functional /dev/fd (the fallback option is named pipes), and it doesn't hardcode bash's install path as /bin/bash
  • Some non-portable uses of echo "\n" which break under BSD systems were replaced with printf invocations
  • Travis CI now checks filter-other-days with Debian's checkbashisms script, which is run in strict mode
  • Non-portable uses of test's -o option were caught by checkbashisms and replaced with ||

With these changes I expect that filter-other-days will probably run on all major BSD distributions. I intend to confirm this hypothesis soon and have filed bugs for OpenBSD and NetBSD, plus illumos just for kicks.

As with 1.0.0, you can clone filter-other-days from GitHub or you can download a (signed) tarball. Please do report any bugs you find in the release.

Enjoy!


filter-other-days: Artificial Ignorance-compatible logfile date filtering

I've just published version 1.0 of my latest project, filter-other-days - a shell script to filter logfiles for today's date in an Artificial Ignorance-compatible way.

If you haven't heard of Artificial Ignorance, it's something you should look into cause it's pretty awesome. Here's the tl;dr: it doesn't make sense to look for all the "interesting" things in logfiles, because it's not actually possible to enumerate all the failure conditions of a system. So instead what we do is throw away entries that we're sure are just routine. Since we've gotten rid of all the uninteresting entries, whatever is left has to be interesting.

I find this pretty compelling, and decided to start implementing it on my Tor relay. I quickly realized that my ideal workflow would be to configure cron to send me email with a daily report of interesting log entries. However, this presented a problem: how to get just today's log entries? I wanted to be able to handle all logfiles at once instead of receiving different reports for different logs, so I had to be able to parse all logfiles in the same way. My relay runs on FreeBSD, so the logs are unstructured text files, and even worse, several daemons (like Tor itself) write timestamps in a different format - this makes parsing all logfiles at once super difficult because I couldn't just trivially grep for today's date since that would end up dropping legitimate entries from logfiles that formatted their timestamps differently.

I briefly considered trying to write a regex to match all sorts of different timestamp formats, but quickly rejected this idea as too fragile. There are a lot of moving parts in a modern operating system - what if e.g. a daemon changed its defaults about how to format timestamps? Or, more likely, what if I simply missed a particular format present in my logs? Then I'd be accidentally throwing away an entire logfile. To solve this problem, I decided to apply the same idea behind Artificial Ignorance - if I couldn't reliably, 100% match log entries from today's date, I could do the next best thing and attempt to discard all entries from other dates. In this case the worst that could happen is me recieving irrelevant information, and I'd be basically guaranteed to never miss an legitimate entry from today.

filter-other-days is a working implementation of this design. Originally I put it with the other random scripts I keep with my dotfiles, but it quickly became obvious that it was useful as a standalone project. So I extracted it into its own repository, which now lives on GitHub. From there I continued to improve the script while adding a test suite and writing extensive documentation (including a Unix manpage - I always feel like a wizardly hacker when writing those things). This took, by my estimation, somewhere between 10 and 15 hours because this is actually a shockingly non-trivial problem, but mostly because regexes are hard.

But today I finally finished! So I'm super excited to announce that version 1.0 of filter-other-days is now available. You can either clone it from GitHub or download a tarball (and the accompanying signature, if you want). It works pretty well already, but I have some ideas for future directions the project could go:

  1. Logic allowing you to actually specify the date you want to filter for, instead of assuming it's today (though you actually can already get this behavior using faketime; that's what the test suite does)
  2. Removal of the dependency on GNU seq - this is, to my knowledge, the only non-POSIX requirement of filter-other-days
  3. Debian package, maybe?
  4. More log formats (please report bugs if you have formats filter-other-days doesn't recognize - which you probably do!)

If you find this project useful, let me know! I'd love to hear about how people are using it. Or if it breaks (or doesn't fill your usecases), please report bugs or send patches - I love those, too! Either way, may the logs be with you!


Getting on board with configuration management

For a long while I've really disliked configuration management. This mostly stemmed from my experience managing Apache via Puppet, which I found indirect and unnecessary - the only reason I did this was basically to get version control. In fact, I even started a project called bindslash which I literally described as "not configuration management".

However, last Thursday, steevie (my primary server) crashed again. So I went into a fallback DigitalOcean VM I'd set up the last time this had happened and updated stuff. I presented my LibrePlanet slides from that. And eventually I bit the bullet and set up a secondary email server which, to my great surprise, has not received a flood of spam yet (though I'm sure it will at some point).

The whole ordeal really made me understand the benefit of configuration management. I would've spent less time and been less stressed if I could just plug in a config management system to get a useful failover system. So as of today, I'm on board with configuration management, and bindslash is dead.

I still kinda hate Puppet, so I think I'll try out Ansible and maybe Chef. Ansible's agentless model in particular probably makes a lot of sense for my needs. It also makes me sad to kill bindslash, since I still think it would be a useful project and there's definitely a place for it in the world. But I no longer have any reason to work on it, so I'm just going to stop pretending I'll ever finish it. If anyone is interested in that approach, talk to me and I'll happily give you the name, the repo, my thoughts on its design, etc.

Anyway. Now to set up outbound mail on the failover VM.

*big sigh*


Revisiting my Tor relay

(Okay, so I miserably failed my blog-every-day thing. Shut up. Maybe next time I'll try every week or something... anyway.)

A couple of days ago I logged into the Tor relay I run to show someone the ARM graphs. I had a fair amount of traffic, so the graphs were fairly impressive, but I'm also in the habit of running apt-get update; apt-get upgrade every time I log into a server, so I did that too. To my surprise, I got a message telling me that there was a dependency problem with my kernel! So like the great sysadmin I am, I looked at such a fundamental system problem, shrugged my shoulders, and said, "oh, I should probably fix that". And then logged out.

Well, I did end up fixing it today. And boy, was it an adventure. My first step was to ignore the APT problems and edit my torrc, to reflect a) the fact that I'm not eligible for the AWS Free Tier anymore (so I needed to throttle bandwidth), b) my new email, and c) my new GPG key. With that being done, I knew that I could easily have the system fix dependency problems by doing a simple apt-get install -f. Easy!

Well, no. That tried to install some Linux kernel headers, which seemed all well and good, until I got this:

Unpacking linux-headers-3.2.0-90 (from .../linux-headers-3.2.0-90_3.2.0-90.128_all.deb) ...
dpkg: error processing /var/cache/apt/archives/linux-headers-3.2.0-90_3.2.0-90.128_all.deb (--unpack):
unable to create `/usr/src/linux-headers-3.2.0-90/arch/arm/plat-pxa/include/plat/dma.h.dpkg-new' (while processing `./usr/src/linux-headers-3.2.0-90/arch/arm/plat-pxa/include/plat/dma.h'): No space left on device
No apport report written because the error message indicates a disk full error
dpkg-deb: error: subprocess paste was killed by signal (Broken pipe)

Um, what? How am I out of free space? Okay, whatever. I knew that there were probably a lot of packages cached in /var/cache/apt/, including old, vulnerable packages that had been replaced by the unattended upgrades system. I did an ls, and found only about five .deb files - something must have been automatically cleaning that directory. I was getting a little worried now, but I nuked the files anyway and reran apt-get install -f. Same thing. Well, okay, maybe I didn't get rid of enough stuff? How much did I need?

$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      4.0G  2.2G  1.6G  59% /

At this point I'm in full-on "something-is-seriously-wrong-and-I-need-to-recover" mode. How was it possible that I had only used 59% of the filesystem, but dpkg was saying my disk was full? A little searching the internet later, I found the culprit:

$ df -i
Filesystem     Inodes  IUsed IFree IUse% Mounted on
/dev/xvda1     262144 257479  4665   99% /
udev            74758    377 74381    1% /dev
tmpfs           76179    259 75920    1% /run
none            76179      3 76176    1% /run/lock
none            76179      1 76178    1% /run/shm

I hadn't run out of disk space. But I had run out of inodes. (Isn't this supposed to happen to other people?)

I tried removing some stuff via APT, but that refused to do anything due to the dependency problems. My next thought was that there were probably a bunch of old processes running that were essentially holding a bunch of inodes hostage. I couldn't install debian-goodies, so I couldn't use checkrestart, but I improvised by looping over all running services in a for loop, and restarting them.

Still nothing.

I'm not proud of what I did next. But I was backed into a corner, so I did something only dpkg is supposed to do. I ran rm -r on a couple directories in /usr/src. And boy, it was like magic. Suddenly apt-get install -f worked like a charm. It started to upgrade a couple packages, rebuilding some GRUB configuration files... and then came to a screeching halt.

Setting up linux-headers-3.2.0-90-virtual (3.2.0-90.128) ...
dpkg: dependency problems prevent configuration of linux-headers-virtual:
linux-headers-virtual depends on linux-headers-3.2.0-68-virtual; however:
Package linux-headers-3.2.0-68-virtual is not installed.
dpkg: error processing linux-headers-virtual (--configure):
dependency problems - leaving unconfigured
No apport report written because the error message indicates its a followup error from a previous failure.
dpkg: dependency problems prevent configuration of linux-virtual:
linux-virtual depends on linux-headers-virtual (= 3.2.0.68.81); however:
Package linux-headers-virtual is not configured yet.
dpkg: error processing linux-virtual (--configure):
dependency problems - leaving unconfigured
No apport report written because the error message indicates its a followup error from a previous failure.
Errors were encountered while processing:
linux-headers-virtual
linux-virtual
E: Sub-process /usr/bin/dpkg returned an error code (1)

Are you kidding?? More errors?

Turns out that APT is essentially the only thing on this system that makes large changes to the filesystem. So the probability that APT would be the program to trigger the inode limit was pretty high. It started an upgrade run, then got interrupted in the middle by the "no space left on device" error, leaving the dependency tree in a state that we in the tech community call "100% totally screwed". (This is the technical term.)

I'll spare you the gory details, but I ended up trying to chase down packages in the Ubuntu archive, running ubuntu-support-status beacuse I was wondering if the packages I was looking for actually weren't in the archive, because they were unsupported, using aptitude instead of apt-get (because aptitude's dependency resolver tends to be better), etc. Finally the solution turned out to be doing dpkg --install on the exact right .debs in the exact right order, which finally satisfied APT's dependency woes, allowed apt-get install -f to fix the configuration problems, and allowed the hundreds of packages which had been waiting for an upgrade to finally install. Whew!

Anyway, I need to upgrade the version of Ubuntu the system is on (currently it's 12.04.5 LTS), because Tor is out of date (among other reasons). However, since that will involve taking the system down for a reboot, I wanted to memorialize the following:

$ uptime
00:01:47 up 392 days, 17:15,  1 user,  load average: 0.05, 0.04, 0.05

Holy moly. This system is bordering on 400 days of uptime. That's over a year of continuous run time! Astonishing.

Wish me luck with this upgrade...

tl;dr: inode limits are killer.


~