Tuesday, December 9, 2008

ESX Madness

Scenario: One application server (Windows Server 2003, IIS)in a VM on ESX 3.5 update 2. Works fine, connects to DB on physical machine over a gigabit link. The addition of a second application server causes connections to the common database to fail under load testing, from both machines. Tried: Looked at IRQ usage in /proc/vmware/interrupts. All seems to be distributing across all cores as expected. Earlier versions of ESX have issues with IRQ usage... see http://www.tuxyturvy.com/blog/index.php?/archives/37-Troubleshooting-VMware-ESX-network-performance.html Checking and optimizing code... only made the problem less frequent. Problem isolated: The problem turned out to be linked to the nic on the VM. The flexible adapter was failing under load for whatever reason. Sadly, this is the only adapter available for Windows Server 2003 Standard 32-bit in ESX 3.5. The Fix: Shut down one of the VMs. Go into the virtual machine's settings change the OS type to Server 2003 enterprise 64-bit. Save and exit the settings window. Re-enter settings and add a network card to the VM. You will see that the e1000 is now available. After installing the e1000 change the OS type back to Server 2003 standard 32-bit and start the VM. In Windows, go to Device Manager and select Properties for the new network adapter. Shut off all of the offloading for TCP checksumming, etc and move your network settings to the new adapter, disabling the old one in Windows and disconnecting it in the ESX VM settings. In summation: Yes, it's a little like duct tape. But it worked for me. And if you are in a pinch and can not afford the downtime to thoroughly examine a flailing production machine, this may work for you too. If it doesn't work for you, I am not responsible for data loss, downtime, etc... blah blah...

No comments: