www.jlion.com

Monday, December 18, 2006

For a couple of days I've been working on a rather strange problem.

I have an application where a windows service calls a web service that in turn calls several other web services. Since this application is currently under development I presently have most of its parts residing on a single 2003 server. The windows service is set to call the first web service every 5 seconds and it, in turn calls the other two web services. Typically this works very well. I've been running it 24x7 for about two weeks and it has only crashed twice.

The first time it crashed, it crashed in the morning on a weekday and I saw fairly quickly that the application had stopped working. After ruling out the usual suspects (database problems, problems with the IVR hardware or with speech server), I found this interesting symptom: When logged in locally to the 2003 server, I was unable to access any of the web services using internet explorer. From my dev PC the web services came up fine, but on the server itself when I tried to view any web page on that server I got a "server not found" error. This happened even when I used the server's IP address instead of a name. At last, out of annoyance and desperation, I rebooted the server.

Voila! After the reboot, I could again locally access web pages on my server. I didn't think much of this at the time. I put it down to weird behavior or perhaps some setting that I had inadvertantly changed. I didn't want to exert too much energy researching a problem that might not ever happen again.

Then this past weekend the problem returned. On Monday morning at 12:00 AM the server stopped making its test calls. Of course I discovered this when I came in and quickly found that the symptoms of the previous problem were again exhibited: I could not access any page on my development server when logged locally. Rather than reboot directly, I tried looking at log files and at IIS and permissions settings.

I could see that starting at about the time that my application stopped, at midnight, and continueing for perhaps an hour there were a series of a hundred or so failure audits in the security event log where Network Service had tried and failed to access the WinHttpAutoProxySvc object. However, at 2:00 AM the failure audits stopped but the web services had not started working again. I rebooted, curious to see if this again would fix the problem and -- voila! -- it did. I could access the web services while logged in locally to the machine.

This just wouldn't do. I couldn't have my application spontaneously crashing once a week or so until someone rebooted the server. Time to call in the heavy guns. So I turned to Harold, our network administrator and explained what I was seeing. "Hmmm..." he said, "this sounds like a problem with worker threads. He sat down at the server, pulled up IIS administrator and displayed the properties dialog for the 2.0 application pool. Sure enough, the "Maximum number of worker processes" in the performance dialog was set to 1. Harold increased this to 10 and said, with a confident air, "let me know if you have any more problems -- before you reboot!"