Advanced Namespace Tools blog 5 January 2017

Weird Behavior of CFS ctl file and 9p requests for it

Background and Initial Symptoms

The Cache File Server has an option to collect statistics, which are presented via a file /cfsctl at the root of the tree. While trying to read these stats, I experienced buggy behavior:

I took a look at the code, and added some debugging prints to investigate what was going on. Here is an example of debug output from adding diagnostic prints to the cfsctl section of the genstats and rread functions in cfs.c. Each cycle beginngin with a "statlen" is from successive cats of the file after the data has changed:

statlen 1054 = p 4df8e - statbuf 4db70
cnt 8192 > statlen 1054 - off 0 setting c.rhdr.count to 1054
cnt 8192 > statlen 1054 - off 1054 setting c.rhdr.count to 0
statlen 1055 = p 4df8f - statbuf 4db70
cnt 7138 > statlen 1055 - off 1054 setting c.rhdr.count to 1
cnt 8192 > statlen 1055 - off 1055 setting c.rhdr.count to 0
statlen 1057 = p 4df91 - statbuf 4db70
cnt 7137 > statlen 1057 - off 1055 setting c.rhdr.count to 2
cnt 8192 > statlen 1057 - off 1057 setting c.rhdr.count to 0
statlen 1281 = p 4e071 - statbuf 4db70
cnt 7135 > statlen 1281 - off 1057 setting c.rhdr.count to 224
cnt 8192 > statlen 1281 - off 1281 setting c.rhdr.count to 0
statlen 1275 = p 4e06b - statbuf 4db70
cnt 6911 > statlen 1275 - off 1281 setting c.rhdr.count to -6
(int)c.rhdr.count -6 < 0, sending eof

Contrary to my expectations, statlen seemed to be behaving correctly, but the count and offset taken from the 9p request structure c.thdr (c.rhdr is the reply) looked strange to me. Why was the request corresponding to each new cat operation showing a steadily shrinking count and increasing offset?

A Partial Fix with very Strange Results

The failure to show the data correctly was mostly due to the offset parameter, because the data is sent in the reply by:

c.rhdr.data = statbuf + off;

Because the offset doesn't return to 0 on new reads, only the final few bytes of the changed buffer are shown - with the stale data appearing before them. I decided to hack in some manipulation of the offset, and then things got really weird. Here is my hacked-up debugging version of the code for the cfsctl file in rread:

off = c.thdr.offset;
cnt = c.thdr.count;
if(statson && ctltest(mf)){
/* statsend is a variable I added to help control the output behavior */
	fprint(2, "rread cfsctl:\n");
	if(statsend == 0)
		off = 0;
	if(statsend == 1)
		off = statlen;
	statsend++;
	if(statsend == 2)
		statsend = 0;
/* The idea is that we send all the data from 0 to statlen on the first request, then, then set the offset equal to the amount read previously and send no data on the second read - the rest of the logic is unchanged save debugging prints */
	if(cnt > statlen-off){
		c.rhdr.count = statlen-off;
		fprint(2, "cnt %d > (statlen %d - off %lld) setting c.rhdr.count to %d\n", cnt, statlen, off, c.rhdr.count);
	}
	else{
		c.rhdr.count = cnt;
		fprint(2, "cnt %d <= statlen %d - off %lld, c.rhdr.count set to cnt %d\n", cnt, statlen, off, c.rhdr.count);
	}
	if((int)c.rhdr.count < 0){
		fprint(2, "(int)c.rhdr.count %d < 0, sendreply(eof)\n", (int)c.rhdr.count);
		sendreply("eof");
		return;
	}
	c.rhdr.data = statbuf + off;
	fprint(2, "c.rhdr.data %p set from statbuf %p + off %lld\n", c.rhdr.data, statbuf, off);
	sendreply(0);
	return;
}

This did fix the issue partially - I now received correctly updated stats data from every read of the file, with no corruption and no eof errors. However, something even stranger (to me at least!) started happening:

All of the previous reads from the file were also printed, with the new data appended at the end

So, as the fs was used and I read from the ctl file repeatedly, the output would be like this:

        Client                          Server
   #calls     Δ  ms/call    Δ      #calls     Δ  ms/call    Δ
      1       1   0.750   0.750       1       1   0.743   0.743 Tversion
      7       7   0.575   0.575       7       7   0.569   0.569 Tauth
      7       7   0.893   0.893       7       7   0.888   0.888 Tattach
    325     325   0.490   0.490     324     324   0.486   0.486 Twalk
    147     147   0.474   0.474     146     146   0.470   0.470 Topen
    764     764   0.239   0.239      18      18   3.644   3.644 Tread
     16      16   5.071   5.071      16      16   5.065   5.065 Twrite
    135     135   0.586   0.586     135     135   0.581   0.581 Tclunk
    169     169   0.468   0.468     169     169   0.455   0.455 Tstat
     11      11 ndirread
      7       7 ndelegateread
      0       0 ninsert
      0       0 ndelete
      5       5 nupdate
1716594 1716594 bytesread
   4668    4668 byteswritten
      0       0 bytesfromserver
   3769    3769 bytesfromdirs
1712825 1712825 bytesfromcache
      0       0 bytestocache
        Client                          Server
   #calls     Δ  ms/call    Δ      #calls     Δ  ms/call    Δ
      1       0   0.750               1       0   0.743         Tversion
      7       0   0.575               7       0   0.569         Tauth
      7       0   0.893               7       0   0.888         Tattach
    327       2   0.490   0.407     325       1   0.487   0.799 Twalk
    149       2   0.499   2.373     147       1   0.470   0.442 Topen
    766       2   0.275  14.317      18       0   3.644         Tread
     16       0   5.071              16       0   5.065         Twrite
    137       2   0.620   2.962     137       2   0.616   2.955 Tclunk
    169       0   0.468             169       0   0.455         Tstat
     11       0 ndirread
      7       0 ndelegateread
      0       0 ninsert
      0       0 ndelete
      5       0 nupdate
1716594       0 bytesread
   4668       0 byteswritten
      0       0 bytesfromserver
   3769       0 bytesfromdirs
1712825       0 bytesfromcache
      0       0 bytestocache
        Client                          Server
   #calls     Δ  ms/call    Δ      #calls     Δ  ms/call    Δ
      1       0   0.750               1       0   0.743         Tversion
      7       0   0.575               7       0   0.569         Tauth
      7       0   0.893               7       0   0.888         Tattach
    388      61   0.496   0.529     385      60   0.494   0.532 Twalk
    152       3   0.520   1.527     149       2   0.471   0.527 Topen
    773       7   0.311   4.267      20       2   3.330   0.507 Tread
     17       1   5.111   5.746      17       1   4.822   0.940 Twrite
    140       3   0.638   1.451     140       3   0.634   1.442 Tclunk
    227      58   0.481   0.517     227      58   0.468   0.507 Tstat
     13       2 ndirread
      7       0 ndelegateread
      0       0 ninsert
      0       0 ndelete
      5       0 nupdate
1736279   19685 bytesread
   4693      25 byteswritten
      0       0 bytesfromserver
  11431    7662 bytesfromdirs
1724848   12023 bytesfromcache
     25      25 bytestocache

Every time I read from the /cfsctl file after the data had been changed, the new data would be appended at the end, and this process continues arbitrarily. This seems paradoxical to me because the statbuf is a static array of 2048 bytes, and there are only 2 9p requests being fulfilled, one for a bit over 1000 bytes, and one for 0. Here are what the debugging prints look like with this version of the code (note that these offsets are fake, the actual c.thdr offset request size is huge:

rread cfsctl:
cnt 8192 > (statlen 1127 - off 1127) setting c.rhdr.count to 0
c.rhdr.data 4e067 set from statbuf 4dc00 + off 1127
genstats: statlen 1127 = p 4e067 - statbuf 4dc00
rread cfsctl:
cnt 4811 > (statlen 1127 - off 0) setting c.rhdr.count to 1127
c.rhdr.data 4dc00 set from statbuf 4dc00 + off 0
rread cfsctl:
cnt 8192 > (statlen 1127 - off 1127) setting c.rhdr.count to 0
c.rhdr.data 4e067 set from statbuf 4dc00 + off 1127
genstats: statlen 1127 = p 4e067 - statbuf 4dc00
rread cfsctl:
cnt 3684 > (statlen 1127 - off 0) setting c.rhdr.count to 1127
c.rhdr.data 4dc00 set from statbuf 4dc00 + off 0
rread cfsctl:
cnt 8192 > (statlen 1127 - off 1127) setting c.rhdr.count to 0
c.rhdr.data 4e067 set from statbuf 4dc00 + off 1127
genstats: statlen 1127 = p 4e067 - statbuf 4dc00
rread cfsctl:
cnt 2557 > (statlen 1127 - off 0) setting c.rhdr.count to 1127
c.rhdr.data 4dc00 set from statbuf 4dc00 + off 0

This data was collected in a test where I was trying to see just how far the extra buffering/appending would go - and so far I have not found a limit. At the moment, I am receiving over a megabyte of data from reads of /cfsctl:

cpu% cat /cfsctl |wc

24184 130469 1200544

All of this data is not being stored anywhere by the cfs process itself, its memory footprint is far too small:

cpu% ps -a |grep cfs
glenda          211    0:00   0:56      312K Pread    cfs

Here is some of the data from the raw 9p requests printed after the convM2S in rcvmsg:

rcvmsg: count is 1430 offset is 1194602
rcvmsg: count is 8192 offset is 1195729
rcvmsg: count is 8192 offset is 1195729
rcvmsg: count is 20 offset is 323088
rcvmsg: count is 8192 offset is 1195729
rcvmsg: count is 303 offset is 1195729
rcvmsg: count is 8192 offset is 1196032
rcvmsg: count is 8192 offset is 1196032
rcvmsg: count is 20 offset is 323088
rcvmsg: count is 8192 offset is 1196032
rcvmsg: count is 8192 offset is 1196032

I have some logic to turn on/off those prints depending on whether or not the /cfsctl file is being read, so I'm not sure if the "count is 20" messages are part of the cfsctl read transaction. Complete debug code that I'm running is at http://files.9gridchan.org/cfsdebug.c and the raw logs that im extracting these debug sample outputs from is at http://files.9gridchan.org/cfsdebug.log

So What/Why/How ?

What seems like it must be happening is that the kernel is caching responses to reads of the /cfsctl file and maintaining the offset between successive invocations of cat. Somewhere in the communication chain between cat, the mount device (devmnt.c in the kernel) and the actual cfs program, things are getting confused, and the kernel must be keeping a large buffer of data which it is replaying to each new cat.

The cause of the progressively larger offsets to cat is unclear to me. I don't understand how whatever cfs is doing wrong is causing cat/devmnt to behave in this way.

The Lightbulb Goes On

After writing up all the above, I was ready to consult the lead 9front dev, Cinap. As usual, he was able to diagnose and fix in the issue in about five minutes of total irc discussion. Ironically enough for me as a namespace fanatic, I was forgetting something about namespaces in the standard /lib/namespace file. The very first line:

mount -aC #s/boot /root $rootspec

The -C flag is the key here. As we know from man 1 mount:

-C	(Only in mount.) By default, file contents are
	always retrieved from the server.  With this
	option, the kernel may instead use a local cache
	to satisfy read(5) requests for files accessible
	through this mount point.

So, now things fall into place: we are seeing an interaction between a caching mechanism intended for use with fses serving static files, and the synthetic cfsctl statistics file. In addition to pointing out the cause, Cinap also had a fast and easy fix: increment the qid (unique 9p protocol file identifier) at the end of the genstats() routine:

ctlqid.vers++;

Along with a minor adjustment to when genstats() is invoked, that was all that was necessary to fix the issue with the bad behavior of cfsctl in combination with the -C flag to mount.

Lessons Learned

One of the hardest challenges in debugging is remembering everything you know - I certainly knew the -C flag existed in the sense that I had read the manpage for mount many times, and had seen it used in the mount of root in the standard namespace file. Despite this, when my debugging led me to conclude that the kernel was supplying previously cached data to a read request, the existence and relevance of the -C flag completely slipped my mind.

You might say it was a "can't see the forest for the trees" issue - I was bogged down in the details of how cfs was handling 9p messages, and I didn't manage to take a step back from those specifics and notice that the kernel was being specifically told to keep a read cache for the root filesystem which it was providing.