Intermittent DTC Cluster Bomb

  • We are experiencing a strange issue with MSDTC on a 4 node SQL 2008 R2 Ent (SP1) cluster running on Windows 2008 R2 Ent 64-bit.

    We have this configured with 10 SQL instances/services, and a cluster DTC service. Five of the SQL instances also have their own DTC service within their cluster service.

    What we have been seeing recently is an intermittent DTC error, namely:

    OLE DB provider "SQLNCLI" for linked server "LINK_L" returned message "No transaction is active.".

    Msg 7391, Level 16, State 2, Procedure XXX, Line ##

    The operation could not be performed because OLE DB provider "SQLNCLI" for linked server "LINK_L" was unable to begin a distributed transaction

    However, rerunning the query a few seconds later results in the query running correctly.

    After a lot of investigation, config changes, netstat, dtcping and dtctest, we have recorded the following observations.

    -We only get this message when running a DTC query on one of the instances which has its own DTC service specified. The instances using the default DTC service are not affected.

    -The DTC services are all configured the same

    -We have seen this behaviour from external servers running SQL 2005 on 32-bit, and SQL 2008 R2 on 64-bit, but it doesn’t seem to happen when running between instances on the actual cluster.

    -If a DTC connection is established (seen via Netstat), then the query will succeed. If there is no connection established, the initial call will fail with the above error, although a DTC connection will be established. Subsequent calls will then succeed, until the connection is timed out.

    It feels like some kind of hand-shake timeout establishing the DTC connection, but we can’t find any way to configure this, or work out why it only affects calls to the cluster instances running their own DTC service, and not those using the cluster DTC service.

    Has anyone encountered anything similar, or have any suggestions where we can look next??

    Thanks

    [font="Verdana"]Of course I'm grumpy, I'm a DBA.[/font]
    The Grumpy DBA[/url][/size]

  • So now we've tracked it down to a slow call to the cluster DTC service running the "Get Address" eventsubclass

    this is causing an error on the client of "eventid=TRANSACTION_PROPOGATION_FAILED_CONNECTION_DOWN_FROM_REMOTE_TM"

    We can get round this by increasing the timeout on the bind packet on the client machine, by adding a registry key for CmMaxNumberBindRetries, but it still doesn't explain why it's happening.

    Why would DTC take 10-15 seconds to respond to a client?

    What is DTC doing when a client attempts to establish a session with DTC?

    Is there anyway to improve this?

    [font="Verdana"]Of course I'm grumpy, I'm a DBA.[/font]
    The Grumpy DBA[/url][/size]

  • Did it fix the dtc problem by increasing the time out value? Please suggest!

    Thanks.

  • @GrumpyDBA , how did you track it , using MSDTC trace log ?.

    Did increase of timeout solve the issue for you?

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply