A buggy program
Consider the following (contrived) program1 which starts a background process to create a file and then waits while the background process is still running before checking to see if the file exists:
#!/bin/sh # Make sure file doesn't exist. rm -f file # Create file in a background process. touch file & # While there is a touch process running... while ps -C "touch" > /dev/null do # ... wait one second for it to complete. sleep 1 done # Check if file was created. if [ -f file ] then echo "Of course it worked." else echo "Huh? File wasn't created." # Wait for background tasks to complete. wait if [ -f file ] then echo "Now it's there!" else echo "File never created." fi fi # Clean up. rm -f file
Naturally, it will always output
"Of course it worked.",
right? Run it in a terminal yourself to confirm this.
But I claimed this program is buggy; there's more going on.
Breaking the program
Now put your system under load. The
yes command is great for
using up CPU. Running a few instances of it should be enough to keep
your processor(s) busy. Try that script a few more times or use
while true; do ./test.sh; done | uniq
to run it over and over again. Add a
uniq if you want to see
how often it fails. You should see something like
$ while true; do ./test.sh; done | uniq -c 9 Of course it worked. 1 Huh? File wasn't created. 1 Now it's there! 17 Of course it worked. 1 Huh? File wasn't created. 1 Now it's there! 103 Of course it worked. 1 Huh? File wasn't created. 1 Now it's there!
What happened? The process always completes eventually, so why does the initial check not find it sometimes?
Fixing the program
If we replace the
while condition with
while jobs -r | grep -qF "touch"
and change the first line of the script to
then the file will always get created before checking for it,
jobs clearly knows about the process, why doesn't
The answer is that we're asking the wrong question. Let's print out some debug information just after the background task is started to see what is going on:
#!/bin/bash touch file & cat /proc/$(jobs -rp)/cmdline 2>/dev/null | tr '\0' ' ' echo
What this new script is doing is starting the background task and then
jobs -rp to get the PID of the task2 and using the
/proc filesystem to get the command-line of that process
according to the OS3. The
echo are there
just to make the output more human-readable because
/proc/$PID/cmdline has null characters between the
arguments and has no newline at the end.
Once again, run it in a loop with
uniq because most of the results
will be boring:
$ while true; do ./test.sh; done | uniq -c 77 1 touch file 59 1 /bin/bash ./test.sh
Most of the lines are blank, probably due to the subprocess having
already completed. Once we get the command line of the subprocess, as
we would expect if it hasn't finished yet. And once we get
./test.sh. We didn't run any job with that command-line. How did that
happen? At least that explains why
ps sometimes didn't find the
subprocess by the name
What happens when we create a subprocess?
Creating a subprocess is actually a two step operation called
fork-exec after the two system calls
which copies a process returning the child's PID in parent and
exec() which executes a program in the current process.
There's three different outputs because there's three different ways for
the child and parent processes to interleave:
touchcompletes. The output is empty because the child process has already exited.
exec(), during the execution of
touch. The output is
exec()is what changes the value of
fork(). At that point in time, the child still has the parent's
cmdlinewhich is the command-line used to run the script, which is
There is no fourth option of
cat running before
cat occurs after
fork() returns in the parent process.
The Linux debugging utility
strace can be used
to watch those
exec() syscalls (specifically
execve() variants on my
system according to the man pages for
-f option will include
strace -f -e clone,execve ./test.sh
Frustratingly, it turns out, like many race conditions,
this is a heisenbug: I was unable to record a trace of a failing
strace. This isn't terribly surprising as the failure
mode required a very specific interleaving.
While this may have been an informative foray into how subprocesses work
on Linux and debugging methods, the actual takeaway is that
shouldn't be used to manage subprocesses. The shell's
job control support is the proper way to do so.
This program is based on an actual bug one of my officemates had, although it, of course, has been simplified to the point of absurdity. Notably, the original
whileloop implemented a timeout mechanism which has been omitted for simplicity. Furthermore, this StackOverflow answer gives a cleaner timeout implementation for background tasks in a shell script. ↩