The Art of Troubleshooting VM Problems

The Art of Troubleshooting VM Problems



Please add your personal insight to this article.  It is a work in progress.

This article has been initiated due to the number of questions in the Hyper-V forum that generally begin "my applicaiton is not performing well in a VM, how can I fix it"
The intent of this article is to approach the puzzle from just that perspective, something is not right, how do I figure out where it is so I can fix something. 
It is closer to what is considered the art of debugging something that is running in a VM on a hypervisor.

Background

Troubleshooting problems within a virtual environment can be considerably more complex than troubleshooting problems with a physical server.  This is mainly due to the way VMs behave in comparison to physical servers. 

Questions to Ask Yourself

In any virtualization system there are bottlenecks.

The most common bottleneck is disk I/O as this is frequently the most limiting resource - the disks are only so fast, there are only so many read / write heads, etc.

This can be compounded by the workloads in your VMs.  If you have a VDI deployment and you boot a bunch of VMs at the same time, they all kill each other because booting and shutting down are extremely high disk I/O activities.

Beyond disk I/O there are possible networking issues, there is also application and threading issues, and there are issues simply introduced by VM configurations (too many virtual processors is the most common) - these are the most common killers of VMs that I have seen in 6+ years

The way to truly understand your bottleneck is to only run one VM - how does it perform?  What is it doing?  If you open Task Manager in the VM is there a particular process that is taking a lot of the processor time (don't use Perf Mon for this).

Does the application page to disk (not just the OS in the VM)?  Are all the VMs doing the same thing at the same time?  Is the behavior only evident when doing soemthing with the VM over the network?

Configuration of the VM

Believe it or not there are configurations that have a negative impact on performance.  

The most common configuration that has a negative impact is a client operating system (such as Windows XP or Windows 7) that is given too many virtual processors or too much RAM.
Common observation is that client operating systems run better in a VM with one virtual processor and a RAM range between 1 and 4 Gb.

Processor

One impact is related to CPU, threading, handles, and other things related to processor.

The processor comes into play with certain types of applications.  It is not unusual to see an application in a VM that appears to be CPU starved, consuming all of the available CPU and simply crawling.  This is simply an application that is generally single threaded and only tested on bare metal. 

When an application runs on bare metal the application has almost all of a processor available to it - all the handles, all the threads.  It can execute as fast as the CPU can process.  Within a VM this is different.  That processor is divided up and shared among all the VMs.  So the amount of time on the CPU is divided into slices.  This is explained in detail here: http://social.technet.microsoft.com/wiki/contents/articles/hyper-v-concepts-vcpu.aspx

In the end, not all applications virtualize well.  And this is a major reason why.  The VM can be given a higher priority on the CPU and this helps some, but in the end the best fix is to re-write the application, or keep this particular application on a bare metal server.

The application(s)

Many folks overlook the application itself as the root of why it does not perform well within a VM.
A portion of this is the nature of hypervisors and how they share the CPU among many VMs.  There is the concept of CPU time slicing.  I describe that here in the CPU section: http://itproctology.blogspot.com/2009/12/hypervisor-virtualization-basics.html

That said, this will greatly impact any single threaded applications.  Most commonly these are processing workstations that work through video, calculations, simulations, etc.   When running on bare metal, they take all they can get and they get no CPU interruption.  Within a VM they also take all they can get, but they have CPU interruption as they are constantly paused and started as the CPU is shared among everyone.  (this happens even if there is one VM).
A faster CPU means this pause and start happens faster.  So the impact on the VM is less, but it still happens. 

Physical storage disk I/O

This begins the discussion of two common issues; disk I/O contention and application behavior.

Most commonly this impacts Terminal Servers.  You have many applications running and many user sessions.  Most folks don't realize that client applications are constantly caching to the local disk.  And Server administrators generally look at the default parameters of memory, page file, disk IO, etc.  The problem here is that application caching is always local, it always happens, it is not picked up by monitoring page file, it is generally quick on your PC, but not with a Terminal Server.  So, any latency at all with the local disk or the storage and you see a big impact, very quickly.
This is discussed at length here: http://social.technet.microsoft.com/Forums/en-US/winserverhyperv/thread/58d8b4ba-860b-4dd2-ba66-42beddec2173/

Networking

The most common issue that most any folks run in to here is something called TCP Task Offloading.

Many NIC manufacturers include Task Offloading featues in their drivers such as Checksum Offload, Large Send Offload, and other features.  These are desigend to remove processing overheard from the operating system by giving it direct to the network card.  My memory places these features back with the introduction of the 3Com Etherlink III network card and that is also when folks began disabling these features for certain types of application servers.

The symptom that is seen is that network throughput is slow, there might be disconnects.  If you observe the network traffic using a packet sniffer you will notice the traffic is not constant (with an RDP session, SQL connection, or file copy) - instead it is choppy or bursty, and there will be a high amount of ARPs.

Many folks have disabled TCP Task Offload settings for years for application servers such as Remote Desktop / XenApp, Exchange, SQL, File Servers, etc.  with good results. 

One is TCP Task Offload options - TCP Offload, but that is different than Task Offload so I am making sure I call it out.  This is also different from TCP Chimney (this is "TCP Offload" but not "TCP Task Offload")

If this is the issue, method 3 in this article always works:  http://support.microsoft.com/kb/888750

In the end your throughput is limited by your physical layer, NIC, port, switch.  And the protocol you use also has impact.  for example iSCSI should have considerably higher throughput than SMB.  And NFS would also have higher throughput than SMB.  In turn SMB2 is higher than SMB.

You state that you have tried all three types of virtual networks.  So, that means that your test must be VM to VM, on the same host.  This is where network throughput depends on lots of things.

BTW - if you are going VM to Management OS there is a big performance hit when the management network is shared with an External Virtual Network - this has gotten smaller over time, but it still exists it used to be 40%.

Back to the VM to VM test - if you are VM to VM then you can be impacted by storage I/O, especially if the VMs are on the same physical volume.  One reading from and the other writing to can cause what appears to be a network issue, but is really a storage issue.

Leave a Comment
  • Please add 2 and 7 and type the answer here:
  • Post
Wiki - Revision Comment List(Revision Comment)
Sort by: Published Date | Most Recent | Most Useful
Comments
  • Richard Mueller edited Revision 12. Comment: Modified title casing, added tags

  • Carsten Siemens edited Revision 10. Comment: fixed typo

  • FZB edited Revision 9. Comment: typo

  • FZB edited Revision 8. Comment: typo

  • FZB edited Revision 7. Comment: typos

  • FZB edited Revision 6. Comment: typo

  • FZB edited Revision 5. Comment: typo

  • FZB edited Revision 4. Comment: typo

  • FZB edited Revision 3. Comment: typo

  • FZB edited Revision 2. Comment: added TOc

Page 1 of 2 (11 items) 12
Wikis - Comment List
Sort by: Published Date | Most Recent | Most Useful
Posting comments is temporarily disabled until 10:00am PST on Saturday, December 14th. Thank you for your patience.
Comments
  • tonysoper_MSFT edited Revision 1. Comment: typo

  • FZB edited Revision 2. Comment: added TOc

  • FZB edited Revision 3. Comment: typo

  • FZB edited Revision 4. Comment: typo

  • FZB edited Revision 5. Comment: typo

  • FZB edited Revision 6. Comment: typo

  • FZB edited Revision 7. Comment: typos

  • FZB edited Revision 8. Comment: typo

  • FZB edited Revision 9. Comment: typo

  • Carsten Siemens edited Revision 10. Comment: fixed typo

  • Richard Mueller edited Revision 12. Comment: Modified title casing, added tags

Page 1 of 1 (11 items)