Sometimes it takes me too long to diagnose problems in a third party library because it’s hard to figure out how to put printfs in the code to see what is actually going on. In this case it was netty 3.2.7.Final that was causing me angst.
But I had sorted out how to patch a library like netty before and I had the Makefile targets lying around to get it done. Once I spent a minute with that I had no trouble tracking down a nasty little problem with netty.
Yeah - so the problem was that I was making an asynchronous connection to a server and all I was getting back was a ClosedChannelException back for my troubles! No explanation at all but I did have a stack trace, which I eventually decided to pay attention to. Here’s one line from the rather ugly stack trace that provided the first clue:
OK - I am getting a timeout while connecting, cool, well I would have preferred a timeout exception as a response but I can work with this! Problem was, The timeout was happening within 50 milliseconds of trying to connect not 10,000 milliseconds which was my actual specified timeout, so what gives? I couldn’t tell if the connect had succeeded, the connected had been attempted, really anything at all! So I added a couple of println’s to the code and the answer became obvious rather quickly.
The problem was that when a connection succeeds, the netty code cancels the selector key associated with that connection. Makes sense: we no longer need to select on connect events once we’re connected. Unfortunately, the next code in the Boss main loop was to check for connection timeouts. It doesn’t do it every time through the loop, only once every .5 seconds, so that’s why this was an intermittent problem. Anyway, this code loops through all the selector keys and closes all channels whose selectors are canceled. WTF!? But note that this means that this problem only occurs when a new connection is completed in the same loop iteration that also processes connection timeouts.
I think the call to close was a cut+paste error so I fixed it by deleting the call to close. Instead, processConnectTimeout just ignores selectors that are canceled. They will be dealt with later anyway.
So here’s the one line change (commented out close call):
Hope somebody finds this helpful.