As many of you did I watched todays Cloud Infrastructure Forum and the release of vSphere 5 today. I was very excited with many of the features such as Storage Profiling, Storage DRS, VMFS 5 release, and they have blown the top off of the resource limits on VMs to create Monster VMs – just to mention a few. However, one topic I notice causing quite a stir is the new licensing that seemed to be very briefly mentioned at the end of the webinar. To quote VMware in page 3 the vSphere 5 licensing guide:
vSphere 5.0 will be licensed on a per-processor basis with a vRAM entitlement. Each vSphere 5.0 CPU license will entitle the purchaser to a specific amount of vRAM, or memory configured to virtual machines. The vRAM entitlement can be pooled across a vSphere environment to enable a true cloud or utility based IT consumption model. Just like VMware technology offers customers an evolutionary path from the traditional datacenter to cloud infrastructure, the vSphere 5.0 licensing model allows customers to evolve to a cloud-like “pay for consumption” model without disrupting established purchasing, deployment and license- management practices and processes.
This caused quite an uproar on twitter of people complaining that it would raise their licensing costs. My personal opinion on the new licensing is both negative and positive. For every negative side I see in something I always try to put a positive spin on it. Firstly it is true that this may cause some highly consolidated shops to have to reasses their infrastructure before they upgrade to vSphere 5. It may require purchase of more licenses to obtain more pooled vRAM to be on the legal side of the licensing. It may also slow adoption as people have to perform audits on their infrastructure to determine what will be needed for the new licensing model. Also for some of the big memory packed beast servers this may prove to be a disadvantage. As I have heard thru the vSphere 5 licensing guide there is no hard limit and vSphere will not stop you from deploying VMs for every licensing model but Essentials which there is actually a hard limit.
On a positive note; as a vSphere Admin this licensing may make my life easier. When application owners realize that there is a charge based on memory use and they may need to sign a purchase order to get their oversized machine approved instead of making their application more efficient they may change their tune a bit. This means less vm sprawl and more focus on what exactly is running in the environment and is it running at its absolute best and most efficient. Also If there is a zombie VM comsuming some valuable vRAM I am sure it will also be found and dispatched more quickly than with the current licensing model.
Let me start this post out with a little story. I am normally a hardcore virtualization and storage guy. Sometimes my career in this sector brings me into working with stuff I haven’t worked with before because virtualization encompasses so much. As I continue to work with other teams I learn more and more about what they do everyday. I usually find myself involved in every performance troubleshooting session and every new project these days. My personal philosophy is the IT guy of the future will be truly converged just as all the technologies are converging into 1 box or “stack”. Specialties in smaller subsets will fall away and a more specialization in everything Datacenter may become the norm.
Early Monday morning I overheard a conversation about connection issues with our new Exchange 2010 environment while drinking some coffee and reviewing my brand new vSphere design. I didn’t think about it very much until my boss came to my desk and asked me to have a look at the problem. Our messaging guy was on vacation and I was the only other person on staff who had some messaging experience. It seems that all of our global and even local offices were complaining about random exchange disconnections and also including email delivery delays from 30 minutes to 4 hours! It seems Activesync devices and OWA users were not affected by these delays at all. Being always up for learning new stuff I took the challenge.
First let’s start with the quick facts I could put together. We had users in every country we have offices complaining about the random disconnections and delays. I had one actually confirmed in China but had some slight trouble getting exact user names from the local IT person. Also we had connections randomly disconnecting and showing disconnected in the lower right hand corner of the outlook client. I did not have any confirmation of who exactly was having these problems. To start I dug through the event logs on all the servers in the Exchange 2010 environment and the amount of errors I found was overwhelming. To shorten this up a bit and not write a novel most all of those I investigated were directly related to running Exchange 2010 SP1 without any update rollups in place. There were corresponding KB articles from Microsoft confirming these fixes in various update rollups.
I noticed an Event ID 2915 on our CAS servers that stuck out. I noticed several EWS and RPC connections reporting “Session Limit Over Budget”. I correlated this with the Default Throttling Policy Exchange 2010 uses. It seems that the more mailboxes a user opens the more connections Exchange creates. It doesn’t somehow truncate these connections. To understand more about the Default Throttling Policy see Understanding Client Throttling Policies. So I quickly whipped up a powershell script that set the Throttling Policy defaults to Null so there was no restrictions (funny Microsoft states as a workaround just to do this if you encounter an issue).
If you are interested in seeing this script or want me to go deeper about Throttling Policies contact me, but this article isn’t quite about this so I will move quickly on.
After the Throttling Policy was changed the reported disconnections stopped but the delivery delays continued as mentioned all around the globe. With the other problem out of the way I began to realize that the problem seemed very random. Some users experienced it, some not, some couldn’t tell me whether they experienced it or not. This is when the hours of fruitlessly digging through configurations to learn them and reading about Exchange 2010 on Google began. I noticed our mailbox servers were set up in an active – active configuration with bidirectional replication using DAGs. This is when I decided to go back to basics of troubleshooting. I went over to my colleague sitting next to me and sent various test messages to him. All of them were promptly delivered without any problems. I noted down what server his mailbox was running on and moved on. Then I walked around the IT department until I was able to find a colleague that confirmed they had the delivery delays up to 4 hours. Just for kicks I turned off their cache mode on the Outlook client and their problem magically vanished. Then I turned cached mode back on and left it broken since I was determined to fix it on the server side and not just band aid the problem. When I went back to my desk and noticed what mailbox server the colleague with the delays was experiencing a light bulb went off and everything seemed to be coming together. Now all I had to do was note the differences between the 2 servers.
First of all to stop the global issue from occurring while I could resolve the problem I failed all the DAG volumes over to the 1 server that did not seem to be having the problem. Reports quickly came in that the problem was resolved. Then I quickly moved on to examining differences between the 2 servers. After comparing windows updates between boxes I noticed that some updates from February were recently applied to both servers, however, there was 1 difference. It seems Microsoft KB2393802 was applied to one server but not the other. I googled regarding this but only found one vague thing about delays in Exchange 2010 mail delivery mentioned in the middle of a technet article relating to this patch, but nothing official at all from Microsoft. I removed the patch, rebooted, and tested with a test mailbox database running on the server I had created for this purpose. The problem was fixed as I thought.
I tried to research on what about this patch could be causing this problem but came up with nothing. If any of you readers have an idea please comment and let me know your thoughts! I have attempted to contact Microsoft regarding this issue so they could possibly append to the KB article but they currently have not replied.
The last day of Tech Field Day #6 myself and all the other delegates were lucky enough to get a sneak peek at stealth startup ‘Zerto‘. We weren’t allowed to talk about it until the 22nd and I know I am a little slow on the punch but I currently haven’t seen a lot of coverage. Just for an initial disclosure statement my trip to Tech Field Day 6 was paid for by the vendors we visited, however, I am in no way obligated to write about them or publicize them in any manner.
Zerto is an Israeli and US based company founded by Ziv and Odem Kedem. They are doing very interesting things in the BC/DR space for the enterprise and cloud sector regarding Virtualization. They promise host based storage agnostic replication and complete vCenter integration. Also a nice feature VM and VMDK consistency grouping, meaning it is built for vSphere environments and replicates on a VM/VMDK level. When I did a little pressing to see how it is done it was discovered that it doesn’t use vStorage APIs at all but it uses a vApp per host and a driver loaded directly into the hypervisor. That would mean it goes much deeper than Changed Block Tracking to determine incremental changes but it actually looks at the data coming thru the vSCSI stack.
It works similar to a lot of current enterprise replication products where in that it splits the IO as reads and writes are coming thru, however, instead of putting it into Array Cache it puts it into memory since it is working directly in the Hypervisor. To credit @gabvirtualworld he mentioned that it uses the VMware IOVP API to complete this task in his post that goes a bit deeper.
Zerto boasts application protection policies and built in support for VSS to attain better application consistency on the other side. This would be useful for example with Virtualized Exchange environments and running databases. The feature I really like is RDM replication to VMDK or the other way around. This would be really useful if you were moving datacenters and wanted to change some things around in your storage configuration during the initial replication stage. What I also like a lot is the ability to create checkpoints/bookmarks on your replicated VMs from different points in time just in case you had a replication of a corrupted VM or data inconsistency that you needed to go back in time to resolve (This is similar to the Recoverpoint technology). See the video below for a quick explanation of their product:
Being kind of an old school FC Network guy and a big user of array specific replication products like SRDF and Recover point (the founders of the company actually created the Recover Point Technology and sold it to EMC) I am still very curious to see the speed and resilliency of the replication. For instance would the built in compression and WAN optimization be enough for a massive 100TB+ environment and how would it handle the initial synchronization?
Would a product such as Riverbed Steelhead or any other WAN optimization products be able to increase the replication efficiency? It would be very interesting over time to see what third party partnerships and certifications they develop to better the usability and maturity of their product.