Filenames are troublesome#

While shell programing lets you write very concise programs, it turns out that the primary use case of working with files is unfortunately much harder than it seems. That detailed article by David A. Wheeler does a good job of explaining all of the various problems that a naive shell script can run into due to various characters which are allowed in filenames which the shell treats specially in some way.

One surprising one is that filenames beginning with a dash (-) can be interpreted as options due to the way globbing works in the shell. Suppose we set up a directory as follows:

$ cat > -n
Some secret text.
$ cat > test
This is a test.
It has multiple lines.

Quick, what will cat * do here?

$ cat *
     1  This is a test.
     2  It has multiple lines.

Probably not what you wanted. The reason that happens is that the * is expanded by the shell before being fed to cat, so the command executed is cat -n test and -n gets interpreted not as a filename but as an option telling cat to number the lines of the output.

The workaround is to use ./* instead of *, so the - will not actually be the first character and therefore will not get misinterpreted as an option. But there are many other things that can go wrong with unexpected filenames and remembering to handle all of them everywhere is error-prone.

Warnings for unsafe shell code#

The solution is shellcheck. shellcheck will warn you about mistakes like the cat * problem and many other issues you may not be aware of.

As I have many shellscripts around that I wrote before learning about shellcheck, I wanted to run it on all of the shell scripts (but not binaries or other language scripts) in my ~/bin directory, so naturally I wrote a script to do so:

#!/bin/sh

find -exec file {} \; \
    | grep -F 'shell script' \
    | sed s/:[^:]*$// \
    | xargs shellcheck

This uses the file command to identify shell scripts and then selects out their file names to run shellcheck on all of them using xargs.

Warnings in Vim#

shellcheck is written to support integration into IDEs. I use Vim to edit shell scripts, so I installed the syntastic (using Vundle which makes installing Vim plugins off GitHub very easy). Note to follow the instructions on the Syntastic page, specifically the recommended settings: without any settings it won't do anything at all. Once set up, it automatically runs shellcheck on every save, identifies lines with warnings and shows a list of warnings that can be double-clicked to jump to the location of the warning.

If you use the other text editor, then the shellcheck website recommends the flycheck plugin.

In my previous post on SSH multiplexing, I gave the following to add to your ~/.ssh/config file without explaining what it actually means:

Host *
ControlMaster auto
ControlPath ~/.ssh/connections/%r_%h_%p

The documentation for ~/.ssh/config can be found at ssh_config(5). Four options are relevant to this post:

Host

The config file is broken up in to sections based on which hosts the configuration options apply to. Host * means these options apply to all connections. If you wanted the options to apply only when connecting to example.com, you could change that line to Host example.com.

Actually, you can also limit configuration options by things other than just the host using the Match directive. For example, this configuration has options for connecting with the username git, presumably due to having multiple git servers that use that username.

ControlMaster

Tells ssh to use multiplexing. Specifically, the default is no, which means it will look for an already open master connection. To actually open a master connection, yes or ask can be used, the latter means that a password prompt will appear when connecting to that master connection. The more useful options for a config file are the auto and autoask options which will use an already open connection if exists, but fall back to acting like yes and ask respectively otherwise.

ControlPath

In order to connect to the master connection, ssh needs a way to communicate with it. This is handled by the master creating a Unix socket which future ssh instances look for. Unix sockets are an IPC mechanism which allows two processes on the same machine to communicate via a connection initiated by one process creating a socket identified by a filename and another using that special file to connect. In comparison with TCP, every server needs its own port number that the client needs to know and any client can connect as long as it knows the port number.

Unix sockets are identified by filenames and ControlPath specifies the filename to use for the socket The %r, %h, %p parts mean the filename should include the remote username, hostname, and port number in order to identify which ssh session is which.

This should usually be enough, but if your home directory is shared among multiple computers, as is common in some university and other large organization setups, then you will also need %l to identify which host you are connecting from. Otherwise ssh may get confused by master connections created by a different host. Luckily, ssh provides a shortcut, which is the %C option which is a hash of all 4 (although it is not available on older versions of ssh):

ControlPath ~/.ssh/connections/%C

or, if you are a disto which does not have %C yet like the latest Ubuntu LTS:

ControlPath ~/.ssh/connections/%L_%h_%p_%r

I used %L for the short version of the local hostname (for example, if %l is foo.example.com, %L would be just foo) because when I used %l, my system complained the filename was too long.

Keep in mind that the socket file is security critical because it is used to piggyback on your existing ssh sessions without authenticating (unless you use the ask or autoask options for ControlMaster), so make sure your ~/.ssh/connections/ directory is readable only by you:

chmod 700 ~/.ssh/connections

ControlPersist

Not used above, the ControlPersist option lets you control when the master connection actually closed. When set to no, it closes with the initial connect. When set to yes it stays open until explicitly closed with ssh -O exit. It can also be set to a length of time to stay open after the last connection is closed.

While the default of ControlPersist is not clearly stated in the documentation, I checked the source code to confirm it does default to no: the default value is set to 0 here if it is still set to its initial value of -1, which is the same value it is given if the configuration file says no.

Twitter no longer offers an RSS feed. That thread offers a few workarounds which involve external or non-free services or require creating a Twitter account. One of those external services, TwitRSS.me is open-source with its code on GitHub. This code can be run locally to view Twitter streams in Liferea (or any other news aggregator) without relying on an external service.

Specifically, the Perl script twitter_user_to_rss.pl is the relevant part. It's intended to be used on a webserver, so the output includes HTTP headers:

Content-type: application/rss+xml
Cache-control: max-age=1800

<?xml version="1.0" encoding="UTF-8"?>
...

which can be cleaned out with tail in the script twitter_user_to_rss_file, which assumes it's in the same directory as twitter_user_to_rss.pl:

#!/bin/sh
"$(dirname "$0")/twitter_user_to_rss.pl" "user=$1&replies=1" \
    | tail -n +4

twitter_user_to_rss_file also handles the argument format of the script, so it just takes a single argument which is the Twitter username. The replies=1 part tells the script to use the Tweets & replies view which includes tweets that begin with @.

When creating a subscription in Liferea, the advanced options include a choice of source type. To use the script, set the source type to Command and the source to

/path/to/twitter_user_to_rss_file username

My version of twitter_user_to_rss.pl includes a few differences from the original that make it a bit more usable. Most importantly, links are made into actual links (based on this code), images are included in the feed content, tweets are marked with their creator to make it easier to follow retweets and combinations of tweets from multiple feeds together in a single stream.

Liferea is a desktop news aggregator (sometimes called an RSS reader). Unlike the late Google Reader or most of its alternatives like the open-source Tiny Tiny RSS which are web-based and run on a server to be accessed via a web browser, Liferea is a separate desktop application and uses an embedded browser to view content.

The problem#

Sometimes you don't actually care about all of the items in a feed and the site provides no filtering mechanism. If the uninteresting items are rare enough, you can just ignore them, but a news aggregator is most useful if it only notifies you of news items you actually might want to read.

The solution#

Luckily, Liferea is very flexible. It supports running a command on a feed which it calls a conversion filter. I wrote some python scripts to filter feeds by title locally.

For instance, I wanted to follow only the changelog posts in the forum feed http://braceyourselfgames.com/forums/feed.php, but it includes changes to all forum topics, so I checked the Use conversion filter option and set the conversion filter to

/path/to/atom_filter_title.py --whitelist "Re: Change log"

rTorrent is a text-based BitTorrent client, which makes it convenient to leave running in a screen or tmux session, so you don't have to leave a terminal window open and you can access it remotely over ssh. It also has an API for web frontends if you don't like text.

Basic setup#

You can set it up to automatically start and stop downloads based on placing .torrent files into a watch/ directory by putting the following in your ~/.rtorrent.rc:

# Default session directory. Make sure you don't run multiple instance
# of rtorrent using the same session directory. Perhaps using a
# relative path?
session = ./session

# Watch a directory for new torrents, and stop those that have been
# deleted.
schedule = watch_directory,5,5,load_start=./watch/*.torrent

Those settings also use a session directory to keep track of torrents across runs of rTorrent, which is useful if you have a lot of torrents and want to be able to restart rTorrent, say, after rebooting your computer. Note rTorrent will complain if the session directory doesn't already exist, so your first run will look like

$ screen
$ mkdir session watch
$ rtorrent

That configuration uses relative paths for watch/ and session/ so you can have multiple instances of rTorrent in different directories.

`magnet:` links#

In additional to .torrent files, BitTorrent also supports magnet: links as a way to join a torrent without needing a file. There is built-in support for magnet: links in rTorrent, but it requires a little extra work to make clicking one in a web browser start the download in rTorrent. Here's a script for doing so along with instructions for having your web browser use it to handle magnet: links. I modified it to handle multiple watch/ directories:

#!/bin/bash

DEFAULT_WATCH='/path/to/your/watch'
if [[ $# -ge 2 ]]
then
    WATCH="$2"
else
    if [[ -z "$DISPLAY" ]]
    then
        WATCH="$DEFAULT_WATCH"
    else
        WATCH=$(zenity --file-selection --directory --title="Select rtorrent watch directory" --filename="$DEFAULT_WATCH")
        [[ "$(basename "$WATCH")" = watch ]] || exit;
    fi
fi
cd "$WATCH"
[[ $1 =~ xt=urn:btih:([^&/]+) ]] || exit;
echo "d10:magnet-uri${#1}:${1}e" > "meta-${BASH_REMATCH[1]}.torrent"

This script uses bash because it uses the bash-only =~ operator for regular expression matching.

This script has a hard-coded default directory to use, but supports either specifying a different directory as the second argument or will use zenity to show a dialog asking the user to select a watch/ directory. zenity is quite useful for easily adding interactivity to shell scripts, especially for something like a directory chooser which doesn't work as well in text.

The problem#

If you are making a lot of SSH connections, starting each connection can add noticeable overhead. Even worse, a firewall might start blocking the connections as many SSH connections from the same source looks a lot like an attacker trying to guess a password, as one of my officemates discovered recently.

The solution#

SSH has a feature called multiplexing, which is described in this blog post, along with a few other useful SSH tips. Here's the relevant excerpt:

In a shell:

$ mkdir -p ~/.ssh/connections
$ chmod 700 ~/.ssh/connections

Add this to your ~/.ssh/config file:

Host *
ControlMaster auto
ControlPath ~/.ssh/connections/%r_%h_%p

The details#

While ssh is often used as just a secure version of telnet, it's actually closer to being a VPN system, supporting many channels of communication over the same encrypted link, which is how port forwarding over SSH is implemented.

Normally SSH makes a connection and opens a single channel for the terminal. Multiplexing merely means keeping that connection open for additional terminal channels. The settings described tell SSH to keep track of open connections in ~/.ssh/connections/ and automatically reuse an open connection whenever possible.

The firewall#

The firewall which caused this post to get written was keeping track of how many new SSH connections were made to a host and only allow a maximum of 3 new connections each minute. As the firewall was not paying attention to whether the connections were accepted, my officemate's script which performed multiple copies and remote commands was getting blocked.

The problem#

I used to have an occasionally unreliable internet connection. I wanted logs of exactly how unreliable it was and an easy way to have notice when it was back up.

The solution#

Use cron to check online status once a minute and write the result to a file. An easy way to check is to confirm that google.com will reply to a ping (this does give a false negative in the unlikely event that Google is down).

To run a script every minute, put a file in /etc/cron.d containing the line

* * * * * root /root/bin/online-check

where /root/bin/online-check is the following script:

#!/bin/sh

# Check if computer is online by attempting to ping google.com.
PING_RESULT="`ping -c 2 google.com 2>/dev/null`"
if [ $? -eq 0 ] && ! echo "$PING_RESULT" | grep -F '64 bytes from 192.168.' >/dev/null 2>/dev/null
then
    ONLINE="online"
else
    ONLINE="offline"
fi
echo "`date '+%Y-%m-%d %T%z'` $ONLINE" >> /var/log/online.log

The details and pretty printing#

A buggy program#

Consider the following (contrived) program¹ which starts a background process to create a file and then waits while the background process is still running before checking to see if the file exists:

#!/bin/sh

# Make sure file doesn't exist.
rm -f file

# Create file in a background process.
touch file &
# While there is a touch process running...
while ps -C "touch" > /dev/null
do
    # ... wait one second for it to complete.
    sleep 1
done
# Check if file was created.
if [ -f file ]
then
    echo "Of course it worked."
else
    echo "Huh? File wasn't created."
    # Wait for background tasks to complete.
    wait
    if [ -f file ]
    then
        echo "Now it's there!"
    else
        echo "File never created."
    fi
fi

# Clean up.
rm -f file

Naturally, it will always output "Of course it worked.", right? Run it in a terminal yourself to confirm this. But I claimed this program is buggy; there's more going on.

When you start getting disk full messages on Linux, there's a few different reasons why that might happen:

The expected. Too many large files. You can track down large directories using WinDirStat or
```
du -hx --max-depth=1 | sort -h
```
where the -x option tells du to not cross filesystem boundaries and the -h option to both uses human-readable sizes like 11M or 1G.
Deleted files aren't actually deleted if they are still open. You can use lsof to find open files. Give it the filesystem as an argument like lsof /home.
By default 5% of each filesystem is reserved for writes by root. Depending on what the filesystem is being used for, this may be too much or simply unnecessary. See this Server Fault answer for how to deal with this.
The files could be shadowed by a mount. If a filesystem is mounted over a non-empty directory, the files in that directory aren't visible.
Last, the disk might not actually be out of space at all. It might actually be out of inodes. Some filesystems, notably the ext2/3/4 filesystems used by default on most Linux distributions have a fixed number of inodes allocated at filesystem creation time. The default is high enough that it is unlikely to be an issue unless there are a very large number of empty files. df -i will show the number of inodes free on each filesystem to verify if a filesystem is indeed out of inodes.

But how do you find those empty files? As described above, du will help find large files, but now we want to find large numbers of files. The following command acts like du -hx --max-depth=$depth | sort -h for inodes instead of file sizes:
```
find -xdev | sed "s@$\([^/]*/$\{$depth\}[^/]*\).*@\1@" | uniq -c | sort -n
```
find -xdev lists all of the files under the current directory on the same filesystem. The sed command finds the first $depth directories (ending in /) and discards the rest of the filename (the .* at the end), so each directory appears once for every file or directory anywhere under it. Then the end of the command counts the repeated lines and sorts by those counts, highlighting the directories with the most files.

The problem#

Transferring many small files is much slower than you would expect given their total size.

The solution#

tar c directory | pv -abrt | ssh target 'cd destination; tar x'

cd destination; ssh source tar c directory | pv -abrt | tar x

A Weird Imagination

Checking for unsafe shell constructs

Filenames are troublesome#

Warnings for unsafe shell code#

Warnings in Vim#

SSH multiplexing options

Twitter via RSS

Title filtering for Liferea

The problem#

The solution#

Setting up rTorrent

Basic setup#

`magnet:` links#

SSH multiplexing

The problem#

The solution#

The details#

The firewall#

Logging online status

The problem#

The solution#

The details and pretty printing#

Child process not in ps?

A buggy program#

Out of inodes, what now?

Transferring many small files

The problem#

The solution#

The details#

Filenames are troublesome#

Warnings for unsafe shell code#

Warnings in Vim#

The problem#

The solution#

Basic setup#

magnet: links#

The problem#

The solution#

The details#

The firewall#

The problem#

The solution#

The details and pretty printing#

A buggy program#

The problem#

The solution#

The details#

`magnet:` links#