Archive for January, 2009

WCF Reliable Session and keep-alives

Monday, January 26th, 2009

We recently used WCF services for a monitoring application, which showed the status of business-critical systems like water/fire-detection in the datacenter, generators powering the production lines and so on. 

Instead of using a polling mechanism, we used an event-driven approach. The clients would subscribe for certain events and the server would notify the clients when these events happened. Relieving the network of excessive and possibly unneeded chatter. Because of the "business-critical" aspect of the application, we had to ensure that our communication with the server was not halted or interrupted. If something happened to the connection, the client would have to notify the user, instead of showing green lights and giving the user the impression that everything is a-ok, while all hell broke loose in the datacenter (and also destroying the PLC) for instance. Since our client would wait for events to happen and wouldn't know when the connection would have been dropped for some reason.

While working through the WCF documentation, we stumbled upon the Reliable Session which could be specified in the binding of the connection. According to the documentation a Reliable Session would send a keep-alive message after half of the Inactivity Timeout. Unfortunately the expected behaviour was not the same as the actual behaviour, the connection would go to the faulted state after 10 minutes despite configuring an Inactivity Timeout of 10 minutes. (Which should have forced a keep-alive message to be sent after 5 minutes) Eventually we came accross this blog post by Paulo Reichert, clarifying this glitch. Apparently it was supposed to work that way, but a last minute change to the Receive Timeout behaviour overrode the keep-alive behaviour. *couch* Unit Testing *couch* 

In the end we implemented our own keep-alive system in the form of a watchdog. A registry in a PLC would change every x seconds and the client would be notified of this change through an event. If our client does not detect a change of this registry for y seconds we notify the user. This ensures the connectivity between the client & server (including feedback to the user) and the proper workings of the PLC. The "keep-alive" message resets the Inactivity Timeout, prohibiting it from reaching the specified value and causing the connection to be dropped.

Setting the Receive Timeout to infinite, as suggested by Microsoft, is a work-around which does not solve the problem at a fundamental level. We sincerely hope Microsoft will fix this bug in a future release!