Sign in

    Troubleshooting SQL Server Always On Availability Groups

    By: Ahmad Yaseen

    In this article, we will discuss several issues that you may face while creating, configuring or maintaining an Always on Availability Group site.

    Before going through this article, it is recommended to read the previous article, Setting up and Configuring Always on Availability Group in SQL Server, to be familiar with the Always on Availability Group concept and the New Availability Group wizards shown in this article.

    Always on Availability Group Feature Not Enabled

    Assume that, while trying to create a new Always on Availability Group, from the Always On High Availability node, under the Object Explorer of the SQL Server Management Studio, you faced the error message below:

    The Always On Availability Groups feature must be enabled for server instance 'SQL1' before you can create an availability group on this instance. To enable this feature, open the SQL Server Configuration Manager, select SQL Server Services, right-click on the SQL Server service name, select Properties, and use the Always On Availability Groups tab of the Server Properties dialog. Enabling Always On Availability Groups may require that the server instance is hosted by a Windows Server Failover Cluster (WSFC) node. (Microsoft.SqlServer.Management.HadrTasks)

    TroubleShootingAlwaysOn_image1

    It is clear from the error message that, the AlwaysOn Availability Groups feature should be enabled on each SQL Server instances that participate in the Always on Availability Group site, before creating that site.

    You can easily enable the Always on Availability Group feature, by opening the SQL Server Configuration Manager console, browse the SQL Server Services tab then right-click on the SQL Server Database Engine service and choose the Properties option.

    From the opened SQL Server Properties window, move to the Always on High Availability tab and check the checkbox beside the Enable Always on Availability Group, taking into consideration that this change requires restarting the SQL Server service to take effect, as shown below:

    TroubleShootingAlwaysOn_image2

    Database Pre-requisites Validation Issue

    In the earlier steps of the New Availability Group wizard, you will be asked to specify the database(s) that will participate in the Always on Availability Group. Before adding the database, the database should pass the pre-requisites validation check. Otherwise, the database cannot be selected from the databases lists, as shown in the error message below:

    To be added to an availability group, this database must be set to the full recovery model. Set the Recovery Model database property to Full and perform a full or differential database backup on the database. You will then need to schedule log backups on the database.

    TroubleShootingAlwaysOn_image3

    The message is clear. Where the database should be configured with a Full recovery model and a Full or Differential backup should be performed on that database.

    Also, the wizard warns you to schedule a transaction log backup for that database after changing the recovery model to Full, to truncate the transaction log file automatically and prevent running that transaction log file out of free space.

    To fix that issue, change the database recovery model from Simple to Full, from the Options tab of the database properties window, then take a Full backup from that database, as shown below:

    TroubleShootingAlwaysOn_image4

    Refreshing the Select Databases window, the database status will be changed to Meet Prerequisites, as shown below:

    TroubleShootingAlwaysOn_image5

    Shared Network Location Permission Issue

    While trying to configure an Always on Availability Group site, the validation step of the New Availability Group wizard failed with the error message below:

    The primary server 'SQL1' cannot write to '\\SQL1\AlwaysON\BackupLocDb_dbb55cb4-af89-4ed3-b189-1fcaad42358c.bak'. (Microsoft.SqlServer.Management.HadrModel)

    Backup failed for Server 'SQL1'.  (Microsoft.SqlServer.SmoExtended)

    Cannot open backup device '\\SQL1\AlwaysON\BackupLocDb_dbb55cb4-af89-4ed3-b189-1fcaad42358c.bak'. Operating system error 5(Access is denied.).

    BACKUP DATABASE is terminating abnormally. (.Net SqlClient Data Provider)

    In the Full database and log backup initial synchronization method, a shared folder is required to keep the full backup and transaction log backup files temporarily to restore it to all secondary replicas. If the Primary replica is not able to write the backup files to it, or the secondary replicas are not able to read the backup files from it, the New Availability Group validation process will fail as below:

    TroubleShootingAlwaysOn_image6

    To fix that issue, we need to grant the SQL Server Service account of the Primary and Secondary replicas read and write permission on the shared folder shown in the error message, then re-run the validation process, to make sure that all checks are succeeded, as shown below:

    TroubleShootingAlwaysOn_image7

    Windows Failover Cluster Issue

    Assume that you are checking the status of an existing Always on Availability Group site, and see that:

    • The Primary role is moved from SQL1 instance to SQL2.
    • In SQL2, the databases are in the Synchronized state.
    • In SQL1, the databases are not synchronized.
    • SQL1 is in Resolving state.

    As you can see clearly from the SSMS Object Explorer below: 

    TroubleShootingAlwaysOn_image8

    Checking the SQL Server Error logs in the problematic node, we can see that the Availability Group replica becomes offline and the Availability Group stopped working due to an issue in the Windows Server Failover Cluster, as shown in the errors below:

    • Always On Availability Groups: Local Windows Server Failover Clustering node is no longer online. This is an informational message only. No user action is required.
    • Always On: The availability replica manager is going offline because the local Windows Server Failover Clustering (WSFC) node has lost quorum. This is an informational message only. No user action is required.
    • Always On: The local replica of the availability group 'DemoGroup' is stopping. This is an informational message only. No user action is required.

    TroubleShootingAlwaysOn_image9

    The same thing can be detected from the Windows Server Event Viewer, that shows gradually how the replica changes its state to Resolving state, as below:

    • Always On: The local replica of availability group 'DemoGroup' is preparing to transition to the resolving role. This is an informational message only. No user action is required.
    • The availability group 'DemoGroup' is being asked to stop the lease renewal because the availability group is going offline. This is an informational message only. No user action is required.
    • The state of the local availability replica in availability group 'DemoGroup' has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'. The state changed because the availability group is going offline.  The replica is going offline because the associated availability group has been deleted, or the user has taken the associated availability group offline in Windows Server Failover Clustering (WSFC) management console, or the availability group is failing over to another SQL Server instance.  For more information, see the SQL Server error log or cluster log.  If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.

    TroubleShootingAlwaysOn_image10

    To check the status of the Windows Cluster site, we will use the Failover Cluster Manager to see which part of the Windows Cluster is failing.

    But the Failover Cluster Manager shows that the whole cluster is down, as shown below:

    TroubleShootingAlwaysOn_image11

    The first thing to validate here from the Windows Failover Cluster side is the Cluster Service, that can be checked from the Windows Services console, as below:

    TroubleShootingAlwaysOn_image12

    It is clear from the Services console, that the Cluster Service is not running. To fix that issue, start the service from that console, then refresh the Failover Cluster Manager console to make sure that the Windows Cluster site is up and running, as shown below:

    TroubleShootingAlwaysOn_image13

    Checking the Always on Availability Group again, you will see that the databases are in synchronized again and the Always on Availability Group site is in health state again, as shown below:

    TroubleShootingAlwaysOn_image14

    Transaction Log File is Full at the Primary Side

    Assume that you receive the below error message when trying to execute a new query on one of the Always on Availability Group databases:

    TroubleShootingAlwaysOn_image15

    Checking what is blocking the transaction logs file and prevents it from being truncated, you will see that the transaction log file of this database is pending log backup operation to be truncated, as shown below:

    TroubleShootingAlwaysOn_image16

    Taking a transaction log backup for that database, in case you forget to schedule a transaction log backup job, as follows:

    TroubleShootingAlwaysOn_image17

    And check again what is blocking the transaction log of that database, it is showing in my scenario that, it is waiting for Availability_Replica. Which means that the logs are waiting to be written to the secondary replica, but not able to send these transaction logs to the secondary replicas due to an issue in the Always on Availability Group site, as below:

    TroubleShootingAlwaysOn_image18

    The best location to check and troubleshoot the Always on Availability Group site is the Always on Dashboard, which can be opened by right-clicking on the Availability Group name and choose Show Dashboard option.

    From the dashboard, you can see that the Secondary replica SQL2 is not synchronized with the Primary replica, due to connectivity issue, as shown below:

    TroubleShootingAlwaysOn_image19

    Checking the Secondary replica, and making sure that the SQL Server Service is up and running at the secondary side, as follows:

    TroubleShootingAlwaysOn_image20

    Then refreshing the Availability Group dashboard again, you will see that the Always on Availability Group site is healthy again. Checking if the transaction logs file is blocked by any operation, we will see that it is pending OLDEST_PAGE, indicating that the oldest page of the database is older than the checkpoint LSN. This issue can be fixed easily by taking another transaction log backup and the transaction log file will be blocked by nothing, as shown clearly below:

    TroubleShootingAlwaysOn_image21

    Always on Availability Group Failover Misconfiguration

    Assume that the Primary replica becomes offline due to an unplanned issue. As expected, the system will not be affected as an automatic failover operation will be performed and the secondary replica will act as the new Primary replica.

    But in our case, this happy scenario is not valid, where the secondary replica changed to Resolving state and the system is down!

    TroubleShootingAlwaysOn_image22

    Checking the secondary replica’s error log and see why it is not acting as the new Primary as expected, you will see that it is failing due to a role synchronization issue, as shown below:

    The availability group database "AdventureWorks2017" is changing roles from "SECONDARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.

    TroubleShootingAlwaysOn_image23

    This means that there is an issue with the synchronization mode that is used in this Availability Group. The synchronization mode used, can be checked from the Always on Availability Group properties page.

    From the properties page below, it is clear that the Failover mode in this Availability Group is configured to be performed Manually only. In this case, you need to manually perform a failover operation before rebooting or shutting down the server:

    TroubleShootingAlwaysOn_image24

    This can be fixed easily by changing the Failover Mode to Automatic, where an automatic failover operation will be performed in case of any unplanned shutdown or reboot:

    TroubleShootingAlwaysOn_image25

    The same issue can be faced when the Windows Failover Cluster quorum is configured with Node Majority for an even number of replicas, where any failure for one of the servers will bring the Windows Failover Cluster site offline. For more information, check Windows Failover Cluster Quorum Modes in SQL Server Always On Availability Groups:

    TroubleShootingAlwaysOn_image26

    Failover with Data Loss

    Assume that you are trying to perform a manual failover between the Primary and one of the Secondary replicas, but in the Select New Primary Replica window, you see a warning message that the failover operation may end up with data loss as the Primary and the selected Secondary replica are not synchronized, as shown below:

    TroubleShootingAlwaysOn_image27

    To identify the cause of that issue, we will browse the Always on Health events using the Always on Availability Group dashboard, which shows that the Primary replica is not able to open a connection to the Secondary replica, ash shown below:

    TroubleShootingAlwaysOn_image28

    After fixing the connectivity issue between the Primary and the Secondary, refresh the replicas list and you will see that the data loss issue is fixed, as shown below. For more information about troubleshooting the connectivity issues, check  Troubleshoot connecting to the SQL Server Database Engine.

    TroubleShootingAlwaysOn_image29

    Monitoring Always on Availability Group Latency

    The Availability Group dashboard can be modified to include additional columns that provide information about the synchronization latency between Primary and Secondary replicas, including the Commit LSN, Sent LSN and harden LSN values, without showing why there is a latency, as shown below:

    TroubleShootingAlwaysOn_image30

    For more information about measuring the latency, check the Measuring Availability Group synchronization lag.

    Starting from SSMS 17.4, the Always on Availability Group dashboard enhanced to include two new options that are used for latency information calculation, analysis and reporting, which helps in identifying the bottlenecks in the transaction logs flow between the Primary and the Secondary replicas and narrow down the cause of that latency.
    For more information about the new functionality and reports, check to Use the Always on Availability Group dashboard.

    To trigger using this new option, click on Collect Latency Data option from the Always on Availability Group dashboard, that will create a new SQL Agent job on the Primary and Secondary replicas to collect the latency data, As shown below:

    TroubleShootingAlwaysOn_image31

    When the created job execution has completed on all the Availability Group replicas, you will be able to view the latency statistics from the latency reports by right-clicking on the Availability Group name and choose the Primary Replica Latency or Secondary Replica Latency report, based on the replica role in the Availability Group.

    After providing information about the Availability Group replicas, the latency report will show a graphical view of the transaction log commit time on the Primary replica and the remote Hardening time for the secondary replicas, aggregated as average values. Also, the report provides statistical values for the transaction logs send, receive, commit, compress, decompress and other numerical values based on the replica role in the Availability Group.

    For more information about the latency report, check New in SSMS - Always On Availability Group Latency Reports.

    The below report is an example of the latency reports generated from the Secondary replica, showing normal logs transport operations:

    TroubleShootingAlwaysOn_image32

    Also, the Log Block Latency report shows the amount of time, in ms, that the transaction log on the Primary replica waits for Secondary replicas to commit that transaction. After enabling it from the Availability Group Dashboard, you can browse it from the SSMS similar to the previous latency reports. Take into consideration that, the large latency time indicates that the Primary replica is waiting a long time for the Secondary replicas to commit the sent transactions, as shown below:

    TroubleShootingAlwaysOn_image33

     

    April 16, 2020 7:56:00 AM PDT
    Ahmad Yaseen

    Written by Ahmad Yaseen

    Ahmad Yaseen is a SQL Server database administration leader at Aramex International Company with a bachelor’s degree in computer engineering as well as .NET development experience. He is a Microsoft Certified Professional with a good experience in SQL Server development, administration, performance tuning, monitoring and high availability and disaster recovery technologies. Also, he is contributing with his SQL tips in many blogs.

    Topics