Quantcast
Channel: VMware Communities : All Content - All Communities
Viewing all articles
Browse latest Browse all 180329

Windows 2008R2 Domain Controller sometimes goes 100% CPU during a snapshot or vmotion

$
0
0

I have an environment with roughly 17 ESX hosts, 1000 vm's, 100 SQL servers, 500 Citrix servers and 20 or so domain controllers covering just as many different domains.

 

For some years now, we started to get random issues on domain controllers going 100% cpu in the mddle of the night.  While there are a few boxes that seem more likely to go 100% over others, and there are others that don't do it at all, its completely unpredictable.

 

I have identified some 'triggers' over time:

1.  initially, we noticed that during late night reboots of several hundred citrix servers, DRS would kick in do the snap>move>unsnap process and cause a DC to go max CPU.  We cant login anymore and we are forced to hard boot it to get it back to working.

 

2. I have seen occaisional issues where there was no DRS vmotion or snapshot and they still go 100% cpu, but I cant discount the snap/unsnap processes of backup and DRS either.

 

3. Last year we added a backup from SAN system to backup our virtual images directly from the SAN overnight.  Now, I have started to see that the snapping/unsnapping process is beginning to manifest this 100% cpu problem as well.  NONE, as in zero, of our other vms are going 100% cpu after a snap, but once or twice a week we have a group of DC that have random instances of this 100% CPU kill.

 

I definitely see some correlation with the snap/unsnap processes and the activities when a vmotion/DRS/snapshot takes place.  I see no event log entries in windows giving any clues.  In fact, the event log stops logging at the moment it goes 100%cpu.  We could just isolate these boxes and never drs, vmotion or snap them, but thats not completely practical either.  I want to understand what is happening better to mitigate or stop this from occurring.

 

We followed the virtualizing domain controller best practices when we implemented this, and I am now going back and re-reading it again to see if anything is missed.

 

I wanted to post here to see if anyone else has had to battle this problem before and if they had any fixes or thoughts.


Viewing all articles
Browse latest Browse all 180329

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>