Last couple of days I have been setting up and tuning Windows Server 2003 clusters. When performing the various fail-over tests as part of a test plan, one of the cluster nodes suddenly experienced a complete system hang. This happened when I tried to start the cluster server service on the node I was testing on. My terminal session, appeared to be dead, but the TCP/IP session remained open and the RDP client did not time-out as result. I could ping the system, but everything else appeared to be dead. So opening a MMC snap-in to see whether the cluster server service did actually start, didn't do much either, except giving me a empty right pane. Since the servers where physically located at the other side of the country (The Netherlands), and the remote supervisor adapter was not yet built in, due to back-order of the part, I was facing the possibility to need to drive to the server myself. Not something one wants to do in the evening after a long day. So I decided to wait and decide to do with it in the morning. A 15 minutes later or so, I wanted to check the logs on the remaining cluster nodes to see if there's anything out of the ordinary in there, when I noticed the terminal session of the failing node, was back again. I started checking logs and noticed the system had a time gap of 10 minutes in every log I could think of (event log, cluster log etc. etc.). I wanted to see whether I could reproduce the behavior on the other node as well. So I repeated the same tasks, stopping the cluster server service and start it again. Exactly the same happened on the other node as well. I decided to wait and see what happens. After 10 minutes, the system came back again, and responded like nothing happened. This system also had a gap of 10 minutes in its log files. My suspicion rose, that the Anti-virus program (Sophos) was interfering with cluster operations. When strange problems on server occurs, it's usually: lack of disk space, security right issue, driver issue or anti virus software. In my case, anti virus software. I confirmed this, by stopping the Sophos services and do the testing again. This time everything went as it supposed to go. Within minutes, I found a workaround. Simply exclude the %SystemRoot%\cluster folder from on-access scanning. This is a best practise anyway, but this didn't satisfy my need to know...
Why did Sophos hang the system for 10 minutes?
Before I explain what happened, you need to understand what kind of drivers a common on Windows platforms. Basically you can categorize drivers into two groups. One groups of drivers is called device drivers, which communicate with hardware and the second group of drivers called filter drivers, which usually acts as a layer/interface between the OS components and other drivers. All drivers have one thing in common. They're called via an interrupt. There're two types of interrupts.
- Hardware initiated interrupts, which are handled by device drivers.
- Software initiated interrupts.
Interrupts have assigned interrupt levels with it. In the Intel world, the higher the interrupt level, the higher the priority. Below an overview from MSDN with interrupt priority levels:
IRQL IRQL value Description
x86 IA64 AMD64
PASSIVE_LEVEL 0 0 0 User threads and most kernel-mode operations
APC_LEVEL 1 1 1 Asynchronous procedure calls and page faults
DISPATCH_LEVEL 2 2 2 Thread scheduler and deferred procedure calls (DPCs)
CMC_LEVEL N/A 3 N/A Correctable machine-check level (IA64 platforms only)
Device interrupt levels (DIRQL) 3-26 4-11 3-11 Device interrupts
PC_LEVEL N/A 12 N/A Performance counter (IA64 platforms only)
PROFILE_LEVEL 27 15 15 Profiling timer for releases earlier than Windows 2000
SYNCH_LEVEL 27 13 13 Synchronization of code and instruction streams across processors
CLOCK_LEVEL N/A 13 13 Clock timer
CLOCK2_LEVEL 28 N/A N/A Clock timer for x86 hardware
IPI_LEVEL 29 14 14 Interprocessor interrupt for enforcing cache consistency
POWER_LEVEL 30 15 14 Power failure
HIGH_LEVEL 31 15 15 Machine checks and catastrophic errors; profiling timer for Windows XP and later
I highlighted, the three most important ones. Most threads run at IRQL (Interrupt Request Level) = PASSIVE_LEVEL, the lowest available interrupt level. The Windows thread scheduler aka. dispatcher runs are IRQL = DISPATCH_LEVEL. Device interrupts run at IRQL >= 3. IRQL = DISPATCH_LEVEL, is the higest software interrupt available.
Whenever a ISR (Interrupt Service Routine) is entered due to an external interrupt pending. For example, a network packet is received by the NIC (Network Interface Card), the payload is received through its driver TCP/IP receive buffers, a Deferred Procedure Call (DPC) is made and is queued for execution. This is done, to minimize driver code execution time and to speed up things. The DPC thread is later executed by the dispatcher as soon as the ISR exits and no other high priority threads are scheduled.
The dispatcher is the core thread scheduler of the Windows NT family. It's responsible for scheduling threads to run on a processor. This can be driver threads, systems threads (DPC's for instance), user threads. It runs threads in order by priority. This is important to know, since it will point us into the right direction.
So the following order of executing is maintained by the dispatcher and the system:
- Hardware interrupts (IRQL >= 3).
- Software interrupts at DISPATCH_LEVEL (IRQL = 2).
- Software interrupts at APC_LEVEL (IRQL = 1).
- Software interrupts at PASSIVE_LEVEL (IRQL = 0).
Actually, there is an "IRQL = 0.5" level, better known as critical region. This one is available so a portion of the code within a thread can not be interrupted by code that runs at a higher IRQL. Except of course, for hardware interrupts. The Associated API (Application Programmer Interface) with critical regions is: KeEnterCriticalRegion().
Now that I explained the basics, it's time to analyze what happened.
When the Cluster Server service tries to start, it needs to establish communication with the cluster its about to join and it needs to determine whether the cluster is in a valid state. To do that, it needs to execute the following steps:
- Open its local copy of the cluster configuration database to read the cluster configuration parameters like: cluster name, cluster IP address, etc. etc.
- Join the cluster using the informatin obtained from the cluster configuration database.
However, when the Cluster Server service was trying to op its configuration database, Sophos intervened. It did by using its filter driver to intercept the I/O API calls made by the system. Probably API calls like: CreateFile() with exclusive access (dwShareMode=0). To know this for sure, a debugger would have been required, but time and resources prohibited me from investigating. So this is the guessing part.
The following sequence of events probably happened:
- IRQL = 0 : Request by the Cluster Server service account to open its configuration database.
- IRQL = 0 : System thread that opens the file exclusively.
- IRQL = 2 : Sophos filter driver interrupts and open the file itself to scan its content on virusses.
- IRQL = 2 : Sophos probably tries to aquire a spin lock, but can't since the Cluster Server service owns one on the file. Which is fair, since the Cluster Server service probably doesn't want any other program being able to manipulate the configuration database while its being used.
- IRQL > 2 : Network packet is received by NIC and put into the DPC queue by the driver. My RDP session to the system.
- IRQL = 2 : System resumes normal operation, the dispatcher runs and determines the DPC needs to be processed and probably some other system threads too.
- IRQL = 0 : User threads are executed if no other higher priority threads waiting.
At "4" the Sophos filder driver thread loops, until a) the lock is aquired, b) a timer on thread level expires. Since the network packets for RDP session to the system initiates hardware interrupts at "5", the ISR for the NIC is entered and the payload put into the DPC queue. The system then returns to "4" and since the DPC queue runs at IRQL = DISPATCH_LEVEL, the same level as the Sophos filter driver runs, which happens to loop for appromimately 10 minutes, no DPC's are executed and the system appears to be hanging. After these 10 minutes, the Sophos filter driver thread lowers its priority to IRQL = PASSIVE_LEVEL. The dispatches then determines it's time to process the DPC queue at "6" and schedules its thread. After that, any other thread user/system that needs to be run are executed and the system comes back to normal operation at "7" as if nothing happened.
Of course more is involved, since it was a multi processor system, but that would make this explanation unnecessary complicated.
My two cents...