SQL11UPD01-TSCRIPT-07
This wiki is a transcript of a previously recorded video.
Related content assets:
Welcome to introducing SQL Server 2012 AlwaysOn.
In this video we’ll take an introduction into SQL Server 2012 AlwaysOn Failover Clustering. My name is Jonathan Kehayias. I’m a SQL Server MVP and a Microsoft Certified Master for SQL Server 2008. I’m also a Principal Consultant for SQLskills.
In SQL Server 2012 we get an improved high availability and disaster recovery story based on the implementation of our SQL Server 2012 AlwaysOn Failover Cluster instances for SQL Server.
If our business requirements require that we have instance-level high availability and we have integrated disaster recovery across data centers with minimized downtimes associated with running routine maintenance operations in our environment, in SQL Server 2008 R2 and previous multi-site clustering required that we had a stretched vLAN for SQL Server to be able to function inside of the failover cluster. The reason for this limitation was that the IP address was only an AND dependency within the Windows Server Failover Cluster configuration.
Our cluster failover, in SQL Server 2008 R2 and previous, was based on two checks - IsAlive and LooksAlive – with IsAlive performing a check by simply connecting to the instance and running SELECT @@SERVERNAME and by running that check we could get false failovers that occurred inside of our environment or we could get scenarios where we have a resource issue that’s degrading performance but failover was not going to actually occur because the IsAlive check running SELECT @@SERVERNAME actually returned a value back to the Windows Server Failover Clustering that was acceptable to allow the failover cluster to continue to work, even though our performance was degraded and our user queries may not have been able to be executed inside of the environment.
In SQL Server 2012 multi-site clustering works across different subnets, and the way this was implemented was by allowing the IP address to be an AND/OR dependency inside the Windows Server Failover Cluster configuration.
We also have what’s called a flexible failover policy to allow detailed checking of the instance’s health to provide predictable failover for our failover cluster, including scenarios where we have resource contention issues inside of the environment.
Additionally, installation in our environment can be performed on Windows Server Core which minimizes the patching requirements associated with the operating system. So this improves our uptime associated with downtimes for maintenance because we have less maintenance time to take apply Windows OS patches under Windows Server Core.
With AlwaysOn failover clustering there are a lot of enhancements – the key ones being multi-site clustering across subnets, our flexible failover policy, which is implemented through an improved system diagnostics stored-procedure called sp_server_diagnostics.
We also get support for running SQL Server databases natively on network-attached storage using SMB and this is previously been something that was subject to serious performance implications – however, updates inside of the SMB stack, including changes that occurred in SMB 2.1 and 2.2 in Windows Server 2008 R2 have increased the performance to be able to run a database against a network-attached storage device by a factor of more than 3-fold.
One of the benchmarks that’s been used to test this is the TPCC benchmark and previously, comparing network-attached storage to direct-attached storage, they were able to achieve somewhere around 25% of the performance or throughput of the TPCC benchmark. Under the SMB 2.2 stack in Windows Server 2008 R2, after the fixes have been implemented to improve performance, they’ve been able to achieve as much as 95 to 97% of the same throughput characteristics and workload than they could against direct-attached storage for the same TPCC benchmarks.
The last thing you get with failover clustering enhancement in 2012 is the ability to run tempdb on a local drive and this is something that’s been asked for repeatedly over the years, especially as solid-state media has become more mainstream inside of the environment and we can get a ton of I/O by running a solid-state drive on a PCI-X card inside of our failover cluster node. For tempdb, because it’s recreated at startup every time the instance fails over, it didn’t actually need to be stored on our SAN. So that’s a lot of traffic that’s going across our fiber or our iSCSI connections to our SAN that wasn’t totally necessary. That can now be done natively inside of SQL Server 2012 – tempdb can now be created on a local drive, keeping that activity away from our SAN.
For multi-site clustering we get protection against a single-site failure inside of our environment. Typically when we’re talking about setting up a failover cluster across… with a geo-secondary or geo-clustering in our environment, we’re looking at using 3rd-party replication of the SAN storage. It’s typically going to be hardware, but it could be software-based, replication that’s going to move our data from our primary data center to our secondary data center in near real time.
AlwaysOn failover cluster instances in 2012 support multi-site clustering without requiring any kind of stretched vLAN, so we have a much simpler network configuration inside of our environment for running our failover cluster instance.
Each site can have its own separate IP subnet and when failover occurs, the DNS entry inside of the DNS servers will actually get updated to reflect the current IP address that’s associated with whatever subnet the failover cluster is running in at the time that the instance starts up.
Now this is accomplished because in Windows Server Failover Clustering the IP address is no longer just an AND dependency, it is now an AND/OR dependency. So for the virtual network name to come online it has a dependency on both of the IP subnets and that dependency would have to be defined as an OR dependency – so this IP can come online or this IP can come online – and then the virtual network name can start up and update DNS for us.
Now with that, there are some considerations that have to be made for what the host record time-to-live value in DNS should be to reduce the DNS update latency that we’re going to have associated with our client applications that are trying to connect to our failover cluster. If you go with a typical 15 minute time-to-live, it could take 15 minutes for the clients to realize that the DNS record has changed and then be able to reconnect to our server.
That’s not a really good story. So typically you’re going to want to reduce this to something low – maybe 60 seconds because it could take up to 60 seconds for your failover cluster to actually fail over, get started up, and run through crash recovery. And that keeps the clients from caching that record for longer than the time-to-live record, and they will reconnect to the DNS server to get the new IP address and be able to connect to our failover cluster significantly faster.
Our flexible failover policy provides administrators more granular control over the conditions at which and automatic failover is actually going to happen inside of our failover cluster. In 2008 R2 our LooksAlive check simply had the Windows Server Failover Cluster check the service control manager every 5 seconds and it makes sure the service was running inside of the node that is was currently active on.
Our IsAlive check went a little bit further than that – it would actually connect to the instance and select the server name from the failover instance every 60 seconds and the problem with that is it could return values from that internal function when there are resource issues inside of the failover cluster. And it would have been ideal for it not to have returned values and for the failover… or the cluster to have automatically failed over, restoring access to our user requests inside of the environment.
So in 2012, a new stored procedure has been implemented called sp_server_diagnostics and what it does is it outputs actual health information about the instance – for the system, for the resources, for the query processing, for the I/O subsystem, and for any errors that are occurring inside of the environment.
The IsAlive and LooksAlive checks that were performed by Windows Server Failover Clustering have actually been replaced by sp_server_diagnostics and whether or not a failure occurs is based on what the configured failure condition level is for the instance in Windows Failover Cluster Manager.
Configurable options can also help eliminate the false failovers or facilitate an automated failover when a resource enters an unhealthy state that’s going to impact our environment for performance.
So with our flexible failover policy, there are actually six levels. Three is the default. At level zero there’s never going to be an automatic failover or restart of the instance – it will just log any problems that are occurring.
At level one, failover or restart will only occur if the instance is completely down. Level two – if the instance is up and running but it’s unresponsive, it will force and automatic restart or failover. At level three – if the instance is up and running and it encounters a critical server error (for example, a stack dump) it will force a restart or failover of the instance. At level four, any failure… failover restart will occur for moderate errors (for example, a resource being unhealthy: you have low memory that’s being reported to the dm_os_ring_buffers repeatedly – that can cause a restart or a failover. And at level five, any qualified failure inside of the environment (an example being the query processor being unhealthy, and queries being stalled in the environment) – that can force a restart or failover as well.
So once the user sets the flexible failover policy or failover condition level, inside of the cluster properties for the instance, the Windows Server Failover Cluster will continue to do IsAlive and LooksAlive checks. And the way it does that is it talks to the failover cluster instance resource DLL and requests the IsAlive and LooksAlive information from the resource DLL.
The resource DLL in 2012, instead of doing the traditional IsAlive/LooksAlive checks, it will execute sp_server_diagnostics. And once it executes that it’s going to get information back about the system, the resources, the query processing and subsystem in events inside of SQL Server for our failover cluster instance.
Based on the output that comes back out, in the tabular format, it will evaluate the failover condition level and determine whether the instance is actually in a good healthy state or not. If it’s in a healthy state, it returns the appropriate IsAlive and LooksAlive results to the Windows Server Failover Cluster service, allowing the failover cluster service to determine whether it needs to failover or restart the instance or allow it to continue running inside of the host that it’s currently running in or the node that it’s currently running on.
For reduced planned downtime, running on Windows Server Core can reduce the patching of the OS by as much as 50 to 60%. Recently there has been a period of nearly six or seven months where Windows Server Core as an OS did not have a critical update required with it – which means you no longer have to have Patch Tuesday or Patch Thursday in your environment for applying service patches every month to the servers in your environment. You really close that window down and give your environment a better high availability and better uptime solution – based on not having to apply critical patches to the OS repeatedly in the environment.
You still get support for rolling upgrade of SQL Server, so if you wanted to apply a service pack, you can apply it to one of the non –active nodes inside of the environment, for the instance and then fail the instance over to that node and then apply it to the node that you just removed the instance from and basically roll the upgrade across the environment – keeping your instance live with minimal amount of time associated with failover from one node to the other.
You also get faster failover time for both Availability Groups and failover cluster instances inside of SQL Server 2012.
So in summary, in SQL Server 20122 we have better high availability and disaster recovery through the implementation of multi-site clustering across subnets for simplified disaster recovery configurations across out data centers.
Our flexible failover policy allows us to improve the predictability of our instance failover, and get instance failover when we have resource issues in the environment that are preventing optimal performance. And we can get reduced downtime for routine maintenance by using the Windows Server Core installation of SQL Server to reduce our OS patching requirements inside the environment.
Return to SQL Server 2012 Developer Training Kit BOM (en-US)
Ed Price - MSFT edited Revision 1. Comment: title and tags
Fernando Lugão Veltem edited Original. Comment: added tag