In the previous article, Setting Up and Configuring SQL Server Replication, we discussed in-depth, the SQL Server Replication concept, its components, types and how to configure the SQL Transactional Replication step by step. It is highly recommended to go through the previous article and understand the replication concept and its components before reading this article. In this article, we will see how to troubleshoot an existing SQL Server Replication site.
The main goal of the SQL Server Replication is keeping the data in the Publisher and the Subscriber synchronized. In the happy scenario, if a transaction is performed and committed at the publication database, it will be copied to the distribution database then synchronized and applied to all Subscribers connected to that Publisher. If an issue occurs at any step of this process, the Publisher changes will not be available at the Subscriber side. In this case, we need to troubleshoot and fix that issue as soon as possible before ending up with an expired SQL Replication site that should be synchronized again from scratch or a database with its transaction log file runs out of free space, pausing all database transactions.
Identifying at which step the replication synchronization is failing and allocating an indicative error message that leads to fix the issue, is the most challenging part of the SQL Server Replication troubleshooting process. Also, checking the last synchronization time and what changes performed at/after that time that may cause this failure, can also help in troubleshooting the replication synchronization failure.
Understanding the role of the SQL Server Replication agent will help in identifying at which step the synchronization fails. Recall that there are three replication agents that are common between most of the SQL Server Replication types. The Snapshot Agent is responsible for creating the initial synchronization snapshot. The Log Reader Agent is responsible for reading the changes from the database transaction log file and copy it to the distribution database and finally, the Distribution agent that is responsible for synchronizing the changes to the Subscribers.
In this article, we will take advantage of the Replication Monitor and Job Activity Monitor windows in monitoring the SQL Server Replication status and getting information about any synchronization failure error.
The best and straight-forward way to understand how to troubleshoot the SQL Server Replication issues is by providing practical scenarios and showing how to fix this particular issue. Let us start discussing the scenarios one by one.
SQL Server Agent Service Issue
The SQL Server Agent service plays a vital role in the SQL Server Replication synchronization process. This is due to the fact that each replication agent will run under a SQL agent job.
Being a proactive database administrator, you need to check the SQL replication site status on a daily basis. To check the replication site status, right-click on the Publication, under the Replication -> Local Publications node, and choose the Launch Replication Monitor option, as shown below:
From the Replication Monitor window, you can see a warning message, showing that the replication will be expiring soon or already expired, without seeing any indicative error message, as below:
If the Replication Monitor window provides us with no useful information about why the replication site is expiring soon, the next step is to check the Job Activity Monitor under the SQL Server Agent node. Visiting the SQL Server Agent node, you will see directly that the SQL Server Agent Service is not running (from the red circle beside it). If the SQL Server Agent Service is not running, this means that all the jobs created under that instance are not working, including the replication agent jobs. As a result, the overall replication site is not working.
To fix that issue, we need to start the SQL Server Agent service from the SQL Server Management Studio directly or using the SQL Server Configuration Manager (recommended), as shown below:
After starting the SQL Server Agent service, check the Replication Monitor again and make sure that the Subscriber status is Running and all the pending transactions are synchronized with the Subscriber successfully. You can check these steps one by one, by checking that the records are copied from the Publisher to Distributor section:
Then synchronized from the Distributor to the Subscriber successfully, as below:
And finally make sure that there is no undistributed transaction from the last tab, as shown below:
After that, we need to make sure that the replication agents jobs are up and running with no issue. The SQL Agent jobs can be checked by expanding the SQL Server Agent node under the SSMS Object Explorer and view the Job Activity monitor then check if the Log Reader Agent and Distributor agent are running, taking into consideration that the Snapshot Agent will work only during the snapshot creation process, as shown below:
You can also review the history of the replication agents jobs and check the previous failure reason, by right-clicking on that job and choose View History option as below:
Where you may find an indicative error message that helps in overcoming this issue in the future, as below:
To overcome the previous issue, the SQL Server Agent service startup mode should be changed from Manual to Automatic, in this way you will make sure that the service will start automatically when the hosting server is rebooted.
Snapshot Agent Permission Issue
Assume that while checking the SQL Server Replication status, using the Replication Monitor, you noticed that there is a replication failure, from the X sign inside the red circle. And the Replication Monitor shows that the failure is from one of the replication agents, from the X sign inside the red circle at the top of the Agents tab.
To identify that replication failure, we should browse the Agents tab and check which agent is failing. From the Agents page, you will see that the Snapshot Agent is the failing one. Double-click on the Snapshot Agent and review the below error message:
The replication agent has not logged a progress message in 10 minutes. This might indicate an unresponsive agent or high system activity. Verify that records are being replicated to the destination and that connections to the Subscriber, Publisher, and Distributor are still active.
Unfortunately, this error message is generic and it shows only that the Snapshot Agent is not working without specifying the reason, as follows:
Then we need to search for useful information in another place, which is the Snapshot Agent job. From the Job Activity Monitor window, under the SQL Server Agent node, you can see that the Snapshot Agent job is failed. And from that job history, you can see that it failed recently, due to the proxy authentication problem. In other words, the credentials for the account under which the Snapshot Agent runs is not correct, as shown below:
To fix the Snapshot Agent credential issue, right-click on the Publication, under the Replication node -> Local Publication, and choose the Properties option. From the Publication Properties window, browse the Agent Security page and re-insert the credentials for the account under which the Snapshot Agent will run.
After refreshing the Snapshot Agent account credentials, start the Snapshot Agent job again, from the Job Activity Monitor window, and make sure that the job is working fine, as below:
Also, check if the Snapshot Agent is working fine now, and the error message does not appear anymore under the Replication Monitor, as shown below:
Snapshot Folder Permission Issue
Assume that, when trying to synchronize the Publisher and the Subscriber using the initial snapshot or resynchronize the Snapshot replication site using a new snapshot, the snapshot creation process failed with the access error message below:
This error message shows that, the account under which the Snapshot Agent is running does not have permission to access the snapshot folder specified in the error message.
To fix that issue, we need to check the account under which the Snapshot Agent is running, from the Agent Security page of the Publication Properties window, as shown below:
Then browse the snapshot folder specified in the error message and make sure that this Snapshot account has minimum read-write permission on that folder, then run the Snapshot Agent again and see that the issue is fixed now and the synchronization snapshot is created successfully, as below:
Subscriber Permission Issue
Assume that, while checking the SQL Server Replication site status, using the Replication Monitor, you see that there is a failure with the Subscriber, as shown below:
If you click on the error icon, you will see that the failure has occurred when trying to synchronize the transactions from the Distributor to the Subscriber. And from the error message, it is clear that the Distributor is not able to connect to the Subscriber SQL Server instance due to permission issue, as shown below:
To fix that issue, we need to check and refresh the credentials used to connect to the Subscriber instance. To check the credentials, right-click on the Subscription under the Replication node -> Local Publications -> the current Publication name and choose the Properties option. From the Subscriber Connection field under the Subscriber Properties window, refresh the credentials for the account that will be used to connect to the Subscriber instance, as shown below:
After that, check the replication status again from the Replication Monitor and you will see that the Subscriber connection issue is no longer available, and the replication site is running normally, as shown below:
Subscriber Not Reachable
Another SQL Server Replication failure issue you may face from the Subscriber side is that the Distributor is not able to connect to the Subscriber, showing under the Distributor to the Subscriber page that, it is not able to open connection with the Subscriber due to “Network Related … ” connectivity error, shown in the Replication Monitor window below:
This error message is indicating that there is a connection issue between the Distributor instance and the Subscriber instance. The first and straight-forward way to check this connectivity issue is to make sure that the Subscriber SQL Server instance is online. This can be checked from the SQL Server Configuration Manager from the Subscriber side. In our situation, we can see that the SQL Server Service at the Subscriber side is stopped. To fix that issue, start the SQL Server Service and check from the Replication Monitor that the replication site is synchronized again, as shown below. For more advanced SQL connectivity issue, check the Troubleshooting Connectivity MS document:
Subscriber Database Permission Issue
Assume that you are checking the SQL Server Replication synchronization status, using the Replication Monitor, and it is found that the replication is failing while trying to replicate the changes from the Distributor to the Subscriber, Clicking on the subscriber error, you will see that the Distributor is able to reach the subscriber and connect to it, but not able to connect to the Subscription database due to lack of permission issue, as shown below:
To fix that issue, connect to the Subscriber and make sure that the account that is used to connect to the Subscriber database is a member of the db_Owner database fixed role, as shown below:
After that, check the Replication Monitor again and make sure that the Distributor is able to reach the subscription database and replicate the changes, as below:
Data Difference Issue
Assume that one of the database development teams claims that there are some changes that are performed on the Shifts table on the Publisher (SQL1) are not reflected in the daily reports that run on the Subscriber instance (SQL2), and he provided the snapshot below that shows that the changes are not replicated:
The first step in checking the replication synchronization issue is opening the Replication Monitor and find at which step it is failing. From the Replication Monitor, you can see that the Log Reader Agent is failing, as the changes are not replicated from the Distributor to the Subscriber, but no clear message is returned from that agent, as shown below:
As we cannot find meaningful error message from the Replication Monitor, we will check the history of the Log Reader Agent job, using the Job Activity Monitor, which shows that, the credentials for the account under which the Log Reader Agent is running, is incorrect, as shown below:
To fix the Log Reader Agent credentials issue, browse the Agent Security page of the Publication Properties window, and refresh the Log Reader Agent credentials with a valid one, as below:
Checking the Replication Monitor again, you will see that the changes are replicated successfully and that the data is updated with the new shifts changes, as shown below:
Row Not Found at Subscriber
Let us look at the issue from another side. Let’s say, there is a change performed in the shifts table as shown below:
But this change is not replicated to the Subscriber and the overall SQL Server Replication site is failed. From the Replication Monitor, you can see that it is failing while trying to make the change from the Distributor to the Subscriber, and failed due to the fact that it is not able to update that specific record with ID equal to 3, because this record is not available at the Subscriber database table, as shown below:
Checking that record at the Subscriber side (SQL2), you will see that the record is not available, as below:
To overcome this issue, we need to insert that record again to the Subscriber database table and let the Distributor try to update it again, fixing the replication synchronization failure issue, as shown below:
SQL Server provides us with an option to let the replication site continue working even though a data inconsistency issue is found, where you can manually fix this inconsistency issue later. To do so, from the Replication Monitor, right-click on the Subscriber and choose Agent Profile option, as shown below:
From the displayed window, you can update the Log Reader Agent profile and allow it to continue replicating data changes in case there is data inconsistency issue, as shown below:
Uninitialized Subscription Issue
If the replication site is left without monitoring for a long time, and a failure occurred without any fix for more than three days, the replication site will be expired and the Subscription will be marked as uninitialized, waiting to be reinitialized again using a new snapshot. The same scenario can be faced when creating a new Subscription without initializing it, as shown below:
To fix that issue, we should reinitialize that Subscription, by right-clicking on the Subscription under the Replication node -> Local Publications and expand the Publication, then choose the Reinitialize option and mark this Subscription for Initialization and make it ready to receive a new snapshot, as shown below:
If the Subscription status stays Uninitialized after reinitializing it, check the Snapshot Agent job, using the Job Activity Monitor window, and see why it is failing. From the Snapshot Agent job history, you will see that the job failed due to an issue determining the owner of that agent job, as shown below:
To overcome this issue, open the Snapshot Agent job and change the owner of the job to SA or any valid administrator user, and the job will run successfully, as below:
Now you will see that the Subscription status changed to Running, giving that it is waiting for the initial snapshot to start the synchronization process, as shown below:
To generate a new snapshot, right-click on the Publication, under the Replication node-> Local Publications, and select View Snapshot Agent Status option.
From the opened window, click on the Start button to start the snapshot creating process. When the snapshot that contains all the Publisher articles created successfully, open the Replication Monitor again and check the status of the Subscription, where you will see that the snapshot is applied to the Subscriber and synchronized with the Publisher, as shown below:
Publisher Database Owner Issue
Assume also that, when checking the status of the SQL Server Replication site, using the Replication Monitor, the replication site was failed and the failure detected at the Log Reader Agent. Checking the error message returned from that agent, it is found that there is an issue determining the current owner of the Publication database, as shown below:
To fix that issue, we need to update the current publication database owner, by replacing it with a valid database user, using the SP_changedbowner system stored procedure, or simply from the database properties window. After that, run the Log Reader Agent job again, using the Job Activity Monitor window, then validate if the agent issue is no longer available, using the Replication Monitor, as shown below:
In this article, we demonstrated different issues that you may face while using the SQL Server Replication feature to copy data between different sites, and how to fix these issues.
It is highly recommended to keep the SQL Server Engine up to date, with the latest SPs and CUs, so all bugs related to the SQL Server Replication features will be fixed automatically. Lastly, as a proactive SQL Server database administrator, keep an eye on your replication site to fix any issue from the beginning before it becomes larger and harder to fix.