Advanced Namespace Tools blog 18 December 2016

Debugging rcpu out from ANTS service namespace

The Problem

While working on getting my newest node connected to the rest of my grid, I hit a snag. While working in the independent "service namespace" that ANTS creates under the usual userspace, I received errors trying to rcpu out to another node:

Write error
Write error
echo: write error: inappropriate use of fd
bind: /mnt/cpunote: '/mnt/cpunote' not found

I tried to rerootwin into the usual userspace, but received the same errors, minus the final line. If I am within the default user namespace (not entered via rerootwin) then rcpu out to other hosts works correctly. rcpu IN to the service namespace also works correctly. So, there is a bug/incompatiblity in the current service namespace relative to what the new rcpu expects and needs. Finding and fixing this is the current task.

Debugging process

Discovery: rimport works, but rcpu does not. Because rimport and rcpu share much behavior (both use rconnect to connect to exportfs on the remote host) this eliminates many possible sources of error.

Discovery: when using stub to block off access to /net, the sequence of errors is mostly unchanged. I wanted to make sure the errors I was seeing were originating locally, not on the remote side.

Discovery: the ants service namespace does not place a mntgen on /mnt, using instead a premade list of necessary directories. This is the cause of the /mnt/cpunote error. However, adding the mntgen solves only that error, and makes the behavior in the service namespace the same as the behavior in the rerootwin namespace, which still fails with the same output.

Discovery: the initial line of the output appears to be from the pwd command, it changes to match whatever directory the rcpu command was executed from. In a working use of rcpu, that pwd command isn't printed. I believe it is sent to the remote host in the following line, the only place pwd appears and the final line of the rcpu script which 'does the business' after the functions and vars have been defined:

exec $connect $host <{dir=`{pwd} pvar dir cmd; builtin whatis server; echo server} client <[10=0] >[11=1] >[12=2]

Discovery: I verified that the old cpu command works as expected from the service or rerootwin namespace.

Discovery: With some debugging statements echoed to a log file, I can see that nothing is being run on the remote side at all, the server function isn't being executed at all there.

Discovery: Even after an auth/newns from the rerootwin namespace to establish a standard namespace, the same errors appear. (Side note: how does auth/newns work with drawterm without using the savedevs/mydevs trick in the manner of rerootwin? I would expect to lose access to the i/o from /mnt/term devices.)

Discovery: rcpu works after rerootwin -f ramboot when drawterm in to the standard namespace. The combination of this fact with the last fact is making me believe that there must be a different issue involved than something just being missing/in the wrong place in the namespace.

Discovery: the actual root of the issue seems to have to do with the /fd directory, but it is still unclear to me just what is going on. I have discovered that shells within the ants service namespace have 3 more fd files visible in /fd. The rcpu script makes some assumptions about what is going on in /fd. If I increase the fd numbers in the exec $connect command shown above, I no longer see any errors - but I still don't get a working connection, things just fail to work with no error messages displayed. The modified script fails silently the same way when run from the standard namespace.

Discovery: played with the fd numbers in the redirects some more, seems like all the seemingly sensible things I tried (offsetting either/both the left and right side by 3) produced the same kind of silent failure in both standard and nonstand fd environments.

Discovery: here is what cat /proc/$pid/fd looks like in a working environment:

0 r  M 1065 (0000000000000001 0 00)  8192       84 /dev/cons
1 w  M 1065 (0000000000000001 0 00)  8192      650 /dev/cons
2 w  M 1065 (0000000000000001 0 00)  8192      713 /dev/cons
3 r  M 1042 (000000000000890c 1 00)  8192      719 /rc/lib/rcmain
4 rw |    0 (0000000000011241 0 00) 65536      307 #|/data
5 r  M 1065 (0000000000000001 0 00)  8192       84 /dev/cons

And here is what it looks like from an environment where rcpu is failing:

0 r  M 1066 (0000000000000001 0 00)  8192      154 /dev/cons
1 w  M 1066 (0000000000000001 0 00)  8192     1010 /dev/cons
2 w  M 1066 (0000000000000001 0 00)  8192     1073 /dev/cons
3 r  c    0 (0000000000000002 0 00)     0        0 /dev/cons
4 w  c    0 (0000000000000002 0 00)     0       22 /dev/cons
5 w  c    0 (0000000000000002 0 00)     0        5 /dev/cons
6 r  M 1016 (000000000000890c 1 00)  8192      719 /rc/lib/rcmain
7 rw |    0 (000000000000f501 0 00) 65536      307 #|/data
8 r  M 1066 (0000000000000001 0 00)  8192      154 /dev/cons

Something looks quite wrong with fd 3, 4, 5 in that list. The manpage for the dup device and for the proc device don't seem to explain exactly what the 'c' means vs the 'M' - it seems to be "device type" but what is an M device vs a c device? What is creating these 3 'extra' fds which have an iounit of 0? And I still don't get how all this is messing up rcpu.

The Solution

What I had missed in my attempt to fix things by increasing the fd redirection numbers in the exec $connect command was that those file descriptor redirection numbers also appear in the server side fn. With those numbers changed to match, things worked correctly. The reason I did not think to look at the server side fn was that earlier in debugging, it was not being run on the server at all. With the change to the redirects to use higher fd numbers, it was being run, but without the input/outputs matching up correctly, it did nothing.

To fix this in ANTS means I need to add the slightly modified version of the rcpu script with higher numbers to my collection of patched files. I might even ask if the 9front maintainers want to increase these numbers in the standard distribution, because perhaps there are other ways that users might end up with "extra" fds being used in their shell and would run into the same conflict. However, its probably just me and my weird stuff that is affected, so I'm not really expecting such a change. [Later edit - there is more to this story, the next blog post describes the rest of the journey.]

Lessons Learned

This debugging process used up many more hours of work than it probably should have, and I only found the answer with help from cinap lenrek. (Thanks cinap!) I had actually put most of the pieces together, but I was failing to check my own server-side debug logs, which would have shown that after the first change of redirection numbers in the connect command, I was now actually reaching the server, just failing to establish the connection correctly after that.

Even though my first, half-fix change caused the error messages to go away, I failed to realize that the failure mode might be different, and I needed to re-examine the knowledge I had gained from earlier in debugging. I made the incorrect assumption that even without the error messages, the server side fn still wasn't executing.

The main lesson seems to be: when debugging, once you change one aspect/symptom of the problem, you need to recheck all the data you had gathered previously, because things may have changed and the exact mode of failure could be different. This is obvious when stated like this, but in practice, it is easy to forget about exactly what things you are still assuming to be true.