Advanced Namespace Tools blog 19 December 2016

Weirdness relating to fd conflicts from previous post

A Demonstration

From an environment without 'extra' fds:

cpu% echo foo <{derp=`{pwd}; echo $derp} <[10=0] >[11=1] >[12=2]
foo /fd/8
cpu% 

From an environment with 'extra' fds:

helix: echo foo <{derp=`{pwd}; echo $derp} <[10=0] >[11=1] >[12=2]
/usr/glenda
foo /fd/9
echo: write error: inappropriate use of fd
helix: 

So, this seems to demonstrate that subshells and i/o dup/redirection can produce errors from overlapping file descriptors.

Further Debugging

Exactly what was going on at the fundamental system level was still not perfectly clear, so cinap suggested that I run the test cases in ratrace (a Plan 9 syscall tracer). This way, we could see exactly how the kernel was handling the file descriptors and redirections. This produces a huge pile of output even for a tiny shell script like these text cases. The key syscall to zoom in on is 'Dup':

cpu% grep Dup ratsuccesslog
48970 fderrscript Dup f003 0 10 = 10 "" 1482148724057707024 1482148724057708228
48970 fderrscript Dup f003 1 11
48972 fderrscript Dup f003 0 10 = 10 "" 1482148724788513604 1482148724788514820
48972 fderrscript Dup f003 1 11
48970 fderrscript Dup f003 2 12
48972 fderrscript Dup f003 2 12 = 12 "" 1482148725320227147 1482148725320228576
48970 fderrscript Dup f003 7 1 = 1 "" 1482148725719318273 1482148725719320043
48970 fderrscript Dup f003 8 1
48968 fderrscript Dup f003 0 10
48968 fderrscript Dup f003 1 11 = 11 "" 1482148733014151341 1482148733014152856
48968 fderrscript Dup f003 2 12 = 12 "" 1482148733147170765 1482148733147171950
48968 fderrscript Dup f003 7 1 = 1 "" 1482148733281734083 1482148733281735711
cpu% grep Dup ratfaillog
2195 fderrscript Dup 20f43e 0 10 = 10 "" 1482148301362724378 1482148301362725836
2195 fderrscript Dup 20f43e 1 11
2197 fderrscript Dup 20f43e 0 10
2195 fderrscript Dup 20f43e 2 12 = 12 "" 1482148302348441498 1482148302348442616
2195 fderrscript Dup 20f43e 10 1 = 1 "" 1482148302490402844 1482148302490404387
2197 fderrscript Dup 20f43e 1 11 = 11 "" 1482148303329831495 1482148303329832422
2195 fderrscript Dup 20f43e 11 1
2197 fderrscript Dup 20f43e 2 12 = 12 "" 1482148303545435122 1482148303545435982
2193 fderrscript Dup 20f43e 0 10
2193 fderrscript Dup 20f43e 1 11
2193 fderrscript Dup 20f43e 2 12 = 12 "" 1482148310596248603 1482148310596249715
2193 fderrscript Dup 20f43e 10 1

In the failure case, the conflict is on fd 10, which is initially Dup'd to a copy of stdin (fd 0) but then subsequently Dup'd on to stdout (fd 1). In the working example, fd 7 is what gets Dup'd in the latter case. Cinap spent some time analyzing and realized that it is the <{} indirection which is causing the Dup in question. He decided this possibility of fd conflicts represented an actual bug in the rcpu script, so rcpu was changed to avoid doing so, by storing the needed commands in an environment variable:

exec $connect $host /env/rscript client <[10=0] >[11=1] >[12=2]

instead of the earlier version:

exec $connect $host <{dir=`{pwd} pvar dir cmd; builtin whatis server; echo server} client <[10=0] >[11=1] >[12=2]

More Lessons Learned

The previous blog post was written after I had tweaked my local version of the rcpu script to work, and I had been content to say 'ok, works now, problem solved', but Cinap correctly pointed out that it wasn't enough to just fix the symptom, it was important to gather enough data to really understand what was going on and why. A tool like ratrace can generate a seemingly overwhelming amount of intimidating looking output (the greps above capture only a tiny portion of the full trace) but there should always be a way to take small, careful steps forward to build your understanding.

The final result of all the debugging was very positive: I gained some new understanding of how i/o redirection works in Plan 9's rc shell, how to troubleshoot with a powerful tool like ratrace, and the 9front distribution got a small patch to fix a corner-case failure.