Recently I installed a brand new 2-node Hyper-V 2016 cluster with the latest generation of hardware for one of our customers. I finished the setup of the cluster on Friday and happily left for the weekend.
I had a day off on Monday, came back to the office on Tuesday. One of the tasks I still had to do was setting up Hyper-V virtual machines on the cluster. I logged into one of the nodes to continue where I had left it on Friday. As usual the Server Manager popped up and I was ready to close it as soon as it was fully loaded. As it was loading something drew my attention, the server manager was reporting an enormous amount of events…
Weird… it’s a brand-new cluster? Upon inspection, I noticed that all alerts were related to the cluster. Obviously, the next step was to open the failover cluster manager and see what was going on. Starting the cluster manager and connecting to the node took an usual amount of time. Since I was already logged into one node via RDP, I wanted to log in to the other node also via RDP. When trying to establish the RDP connection, a very long time went by… so I killed the connection. I logged in to the out of band management for the server and there as well I only got a black screen back. Once more, strange things were happening. How could the console not give me an image of the server? I was in the process of establishing a remote connection…
I tried once more to create a RDP session, this time I almost immediately got a black screen with an error on it. Since I was no longer able to connect to the server, I rebooted the server.
The server came back up without any issue. I doublechecked to see if there was a hardware issue reported in the out of band, but found nothing. The logs were clean. No hardware issues.
When the server was back up and running, I could log in via RDP. I checked the event log and one error caught my attention. Event id 4005 – Winlogon.exe crashed. That explained why I couldn’t connect to the server.
First thing to check: are all windows updates installed? Check, up to date. My next thought was to check 0the other relevant logs (System, cluster, generate cluster log, …) to see if there was anything else that could point me in the right direction.
After a lot of research, I found out the RHS.exe process of the cluster crashed and took winrm.exe and winlogon.exe with it. Which rendered the server useless until a reboot was performed.
Noticing that a PTR zone was missing in DNS, I started by creating it and adding the relevant records for the cluster. After about 28 hours, I had the same issue. Therefore, that was not the root cause.
Let’s look at the RHS.exe process. RHS contains all the resources running on the cluster. If your host has enough resources, it is possible to put each resource in its own RHS.exe process. This can be done by selecting the resource in the failover cluster manager, going to the resources tab in the detail view of the selected resource, right clicking and selecting properties. Go to the ‘Advanced Policies’ tab and tick the ‘Run this resource in a separate resource monitor’ box:
I made the necessary changes to the resources on the cluster and verified I had multiple rhs.exe processes in task manager.
Now, all I had to do was wait until it crashed. After about 30 hours, the problem popped up again. This time I logged in remotely from the host without the issue onto the other host that was displaying the error, via PowerShell. This allowed me to see the tasklist of running processes. When I checked the tasklist, I was surprised to see that all my RHS.exe processes were still running… yet the cluster crashed. This was a big hint towards the cause of the error.
As explained above, it is possible to split out all resources of a cluster to their own RHS.exe process but this is not valid for CSV’s (Cluster Shared Volumes). The option is not available for CSV’s. Since all my resources were still running on the affected node, and the cluster still crashed, the problem had to be with the CSV. As they were connected via iSCSI, it had to be a problem with one of the iSCSI connections.
I rebooted the affected host and when it was back up, I compared the iSCSI connections on both hosts. Sure enough there was a subtle difference between the two. On the favorite targets tab, the affected host had more targets for one CSV then the host that was working fine. After completely removing the target that had a difference and the related favorite targets from the iSCSI initiator and reconnecting the target, the iSCSI connections on both hosts matched up exactly.
I performed a rescan of the disks in disk manager and tested the failover disks to make sure they was still functional.
To check if this solved the issue, I logged out of the host and waited. I logged in each subsequent day to the host for 10 days without issue.
This is what solved the issue for me. Of course, I did research on the internet for this problem but most of the results I found were related to either 2008 R2, 2012 or 2012 R2. Which had solutions posted for them. However, they were not applicable to the Windows 2016 Hyper-V cluster.