Transactional replication snapshot makes server very slow

  • Hi, I have a very strange problem that i battled with for the last month.

    I have a SQL 2005 server with 7 databases. 4 of them are replicated to failover server using transactional replication (must use transactional, mirror is not an option). 2 of the DBs are quite large 7GB of data and 17GB of data. In both databases there are 1 or 2 tables that contain majority of data. both databases are very busy with high number of transactions and low tolerance to delays. Occasionally replication falls out of synch and needs to be re-initialized. On 7GB database i have no problems, i can run snapshot agent during any time of day with no impact to users. On larger one, the 17GB DB, this does not quite work, which due to 7x24 nature of application is not acceptable. When snapshot agent runs server slowly becomes less and less responsive and eventually transactions begin to simply time-out. Generation of snapshot takes almost 2 hours. SQLserver.exe seems to use less and less memory (according to Task Manager). But i do not see where memory is going to, at least not in task manager. At this point the only choices i have is to kill the snapshot, which restores performance or wait and let it finish. when it finishes, the memory seems be released and sqlserver.exe begins to slowly allocate it back and performance recovers.

    I have played with snapshot agent profile and so far i get best results with very small BcpBathSize of 2000. Larger batch size seems to make this condition appear sooner and more severe. Increasing number of threads makes CPU spike higher and also makes it worse.

    I do not care how long snapshot takes as long as it does not impact production.

    Any help or input is appreciated.

  • Do you have separate distributor and is this is pull or push subscription?

    Why you do you have to reinitialize at time when there is good amount of latency, is replication not catching up?

    To your question about reinitializing you can reinitialize your subscription through backup and restore. You can find implementation details in BOL.

    Hope this helps.

  • I cannot use backup-restore for initialization. some of the tables are not replicated and also SPs are different.

    occasionally replication just fails with "cannot find a row at the subscriber..." message. i suppose i could spend time chasing down these missing rows, but prefer to avoid.

  • oh, and distributor is the same server as publisher. it is a push subscription. replication is over WAN, coast to coast over 10mbs link

  • sam rebus (5/3/2009)


    I cannot use backup-restore for initialization. some of the tables are not replicated and also SPs are different.

    You can always drop the tables that are not replicated. You mean SP's related to articles?

    occasionally replication just fails with "cannot find a row at the subscriber..." message. i suppose i could spend time chasing down these missing rows, but prefer to avoid.

    This you can always figure it out why this is happening as some body or some process is deleting data at subscriber.

    Also is for this issue are your trying to reinitialize entire subscription?

  • sam rebus (5/3/2009)


    oh, and distributor is the same server as publisher. it is a push subscription. replication is over WAN, coast to coast over 10mbs link

    Based on the link value mentioned I think this is taking good amount of time and taking resources also as this is dealing with good amount of data.

  • well, i wish it was so simple, but the link has nothing to do with it. the problems begin during snapshot creation, well before it starts to apply it to subscriber. Applying the snapshot is actually going quite well, no problems there.

    i understand that there are workarounds and things that can be done to avoid re-initialization, but i still need to make snapshot work.

  • The problem is almost certainly occurring because during the generation of the snapshot there is a shared readlock on each table. This is done to ensure that the snapshot is transactional consistent.

    As the snapshot proceeds, more and more tables have the read lock. Practically, this means that for large snapshots, your users are eventually locked out. As the number of users who are locked out increases, the load on your server would tend to decrease. When you stop the snapshot, the system shoudl very quickly return to normal operation since there is nothing much to recover (just a few records in the distribution database).

    I have worked around this in the past by creating multiple publications. I get each of the publication synchonised separately. This technique allows you to break up the publication into smaller chunks which, whilst they will still used the same read lock strategy, because they take less time, the impact is not as bad.

    The number and content of each publication is up to you to work out. This really depends on the size and number of tables that are being published.

  • Brilliant explanation! Thanks, i suspected that there was something along these lines, having to do with locking. so what you are saying is instead of one publication containing all tables create multiple publications with small subset of tables in each? i will be sure to try this, although not entirely convinced that it will help with the largest of tables which takes most of the time to snapshot. I always thought that table locks are only applied for the duration of snapshot of each table and released after table BCP is done. In this case splitting publication into multiple subsets of tables should not really help, am i wrong?

    Is there any way to change locking mechanism during snapshot?

  • I don't think that you can change the locking model used by replication - it is intentionally used to guarentee transactional integrity of the publication.

    You are correct that if you have a single table that is the largest and takes most of the time in the creation of the snapshot, then you will not see much improvement.

    There are other options that you can explore...

    - make sure that the snapshot folder is on a drive that has no other activity

    - use "native mode" for the snapshot

    - consider initialising the subscription without using a snapshot - see BOL article "Initializing a Transactional Subscription Without a Snapshot" for more information on this.

  • Actually for the locking of the table there is a setting in the publication properties that allows you to specify if you want to lock the table or not. I you select this option it will put a place holder for any data that is added. That way users will be able to continue to look at and modify to underlying table.

    Hope this helps.

    Tris

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply